class: center, middle, inverse, title-slide # 02 - grammar of graphics ## Data visualization in R ###
dr. mine çetinkaya-rundel
duke university & rstudio --- class: middle, inverse # 🔗 [bit.ly/dataviz-enar-2022](https://bit.ly/dataviz-enar-2022) To follow along with the exercises, open and make a permanent copy of the RStudio Cloud project at https://rstudio.cloud/project/3796661. --- class: middle, inverse # Data visualization --- ## Data visualization - Data visualization is the creation and study of the visual representation of data - Many tools for visualizing data -- R is one of them - Many approaches/systems within R for making data visualizations -- **ggplot2** is one of them, and that's what we're going to use --- ## ggplot2 `\(\in\)` tidyverse .pull-left[ <img src="images/ggplot2-part-of-tidyverse.png" width="80%" /> ] .pull-right[ - **ggplot2** is tidyverse's data visualization package - `gg` in "ggplot2" stands for Grammar of Graphics - Inspired by the book **Grammar of Graphics** by Leland Wilkinson ] --- ## Grammar of Graphics .pull-left-narrow[ A grammar of graphics is a tool that enables us to concisely describe the components of a graphic ] .pull-right-wide[ <img src="images/grammar-of-graphics.png" width="90%" /> ] .footnote[ Source: [BloggoType](http://bloggotype.blogspot.com/2016/08/holiday-notes2-grammar-of-graphics.html) ] --- ## Hello ggplot2! .pull-left-wide[ - `ggplot()` is the main function in ggplot2 - Plots are constructed in layers - Structure of the code for plots can be summarized as ```r ggplot(data = [dataset], mapping = aes(x = [x-variable], y = [y-variable])) + geom_xxx() + other options ``` - The ggplot2 package comes with the tidyverse ```r library(tidyverse) ``` - For help with ggplot2, see [ggplot2.tidyverse.org](http://ggplot2.tidyverse.org/) ] --- class: middle, inverse # ggplot2 ❤️ 🐧 --- ## ggplot2 `\(\in\)` tidyverse .pull-left[ <img src="images/ggplot2-part-of-tidyverse.png" width="80%" /> ] .pull-right[ - **ggplot2** is tidyverse's data visualization package - Structure of the code for plots can be summarized as ```r ggplot(data = [dataset], mapping = aes(x = [x-variable], y = [y-variable])) + geom_xxx() + other options ``` ] --- ## Data: Palmer Penguins Measurements for penguin species, island in Palmer Archipelago, size (flipper length, body mass, bill dimensions), and sex. .pull-left-narrow[ <img src="images/penguins.png" width="80%" /> ] .pull-right-wide[ ```r library(palmerpenguins) glimpse(penguins) ``` ``` ## Rows: 344 ## Columns: 8 ## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel… ## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse… ## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, … ## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, … ## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186… ## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, … ## $ sex <fct> male, female, female, NA, female, male, female, male… ## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007… ``` ] --- .panelset[ .panel[.panel-name[Plot] <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-9-1.png" width="70%" /> ] .panel[.panel-name[Code] ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, colour = species)) + geom_point() + labs(title = "Bill depth and length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x = "Bill depth (mm)", y = "Bill length (mm)", colour = "Species") ``` ] ] --- class: middle, inverse # Coding out loud --- .midi[ > **Start with the `penguins` data frame** ] .pull-left[ ```r *ggplot(data = penguins) ``` ] .pull-right[ <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-10-1.png" width="100%" /> ] --- .midi[ > Start with the `penguins` data frame, > **map bill depth to the x-axis** ] .pull-left[ ```r ggplot(data = penguins, * mapping = aes(x = bill_depth_mm)) ``` ] .pull-right[ <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-11-1.png" width="100%" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > **and map bill length to the y-axis.** ] .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, * y = bill_length_mm)) ``` ] .pull-right[ <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-12-1.png" width="100%" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > and map bill length to the y-axis. > **Represent each observation with a point** ] .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm)) + * geom_point() ``` ] .pull-right[ <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-13-1.png" width="100%" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > and map bill length to the y-axis. > Represent each observation with a point > **and map species to the colour of each point.** ] .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, * colour = species)) + geom_point() ``` ] .pull-right[ <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-14-1.png" width="100%" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > and map bill length to the y-axis. > Represent each observation with a point > and map species to the colour of each point. > **Title the plot "Bill depth and length"** ] .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, colour = species)) + geom_point() + * labs(title = "Bill depth and length") ``` ] .pull-right[ <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-15-1.png" width="100%" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > and map bill length to the y-axis. > Represent each observation with a point > and map species to the colour of each point. > Title the plot "Bill depth and length", > **add the subtitle "Dimensions for Adelie, Chinstrap, and Gentoo Penguins"** ] .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, colour = species)) + geom_point() + labs(title = "Bill depth and length", * subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins") ``` ] .pull-right[ <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-16-1.png" width="100%" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > and map bill length to the y-axis. > Represent each observation with a point > and map species to the colour of each point. > Title the plot "Bill depth and length", > add the subtitle "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", > **label the x and y axes as "Bill depth (mm)" and "Bill length (mm)", respectively** ] .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, colour = species)) + geom_point() + labs(title = "Bill depth and length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", * x = "Bill depth (mm)", y = "Bill length (mm)") ``` ] .pull-right[ <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-17-1.png" width="100%" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > and map bill length to the y-axis. > Represent each observation with a point > and map species to the colour of each point. > Title the plot "Bill depth and length", > add the subtitle "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", > label the x and y axes as "Bill depth (mm)" and "Bill length (mm)", respectively, > **label the legend "Species"** ] .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, colour = species)) + geom_point() + labs(title = "Bill depth and length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x = "Bill depth (mm)", y = "Bill length (mm)", * colour = "Species") ``` ] .pull-right[ <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-18-1.png" width="100%" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > and map bill length to the y-axis. > Represent each observation with a point > and map species to the colour of each point. > Title the plot "Bill depth and length", > add the subtitle "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", > label the x and y axes as "Bill depth (mm)" and "Bill length (mm)", respectively, > label the legend "Species", > **and add a caption for the data source.** ] .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, colour = species)) + geom_point() + labs(title = "Bill depth and length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x = "Bill depth (mm)", y = "Bill length (mm)", colour = "Species", * caption = "Source: Palmer Station LTER / palmerpenguins package") ``` ] .pull-right[ <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-19-1.png" width="100%" /> ] --- .midi[ > Start with the `penguins` data frame, > map bill depth to the x-axis > and map bill length to the y-axis. > Represent each observation with a point > and map species to the colour of each point. > Title the plot "Bill depth and length", > add the subtitle "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", > label the x and y axes as "Bill depth (mm)" and "Bill length (mm)", respectively, > label the legend "Species", > and add a caption for the data source. > **Finally, use a discrete colour scale that is designed to be perceived by viewers with common forms of colour blindness.** ] .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, colour = species)) + geom_point() + labs(title = "Bill depth and length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x = "Bill depth (mm)", y = "Bill length (mm)", colour = "Species", caption = "Source: Palmer Station LTER / palmerpenguins package") + * scale_colour_viridis_d() ``` ] .pull-right[ <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-20-1.png" width="100%" /> ] --- .panelset[ .panel[.panel-name[Plot] <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-21-1.png" width="70%" /> ] .panel[.panel-name[Code] ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, colour = species)) + geom_point() + labs(title = "Bill depth and length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x = "Bill depth (mm)", y = "Bill length (mm)", colour = "Species", caption = "Source: Palmer Station LTER / palmerpenguins package") + scale_colour_viridis_d() ``` ] .panel[.panel-name[Narrative] .pull-left-wide[ .midi[ Start with the `penguins` data frame, map bill depth to the x-axis and map bill length to the y-axis. Represent each observation with a point and map species to the colour of each point. Title the plot "Bill depth and length", add the subtitle "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", label the x and y axes as "Bill depth (mm)" and "Bill length (mm)", respectively, label the legend "Species", and add a caption for the data source. Finally, use a discrete colour scale that is designed to be perceived by viewers with common forms of colour blindness. ] ] ] ] --- ## Argument names .tip[ You can omit the names of first two arguments when building plots with `ggplot()`. ] .pull-left[ ```r ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, colour = species)) + geom_point() + scale_colour_viridis_d() ``` ] .pull-right[ ```r ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm, colour = species)) + geom_point() + scale_colour_viridis_d() ``` ] --- class: middle, inverse # Aesthetics --- ## Aesthetics options Commonly used characteristics of plotting characters that can be **mapped to a specific variable** in the data are - `colour` - `shape` - `size` - `alpha` (transparency) --- ## Colour .pull-left[ ```r ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm, * colour = species)) + geom_point() + scale_colour_viridis_d() ``` ] .pull-right[ <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-22-1.png" width="100%" /> ] --- ## Shape Mapped to a different variable than `colour` .pull-left[ ```r ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm, colour = species, * shape = island)) + geom_point() + scale_colour_viridis_d() ``` ] .pull-right[ <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-23-1.png" width="100%" /> ] --- ## Shape Mapped to same variable as `colour` .pull-left[ ```r ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm, colour = species, * shape = species)) + geom_point() + scale_colour_viridis_d() ``` ] .pull-right[ <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-24-1.png" width="100%" /> ] --- ## Size .pull-left[ ```r ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm, colour = species, shape = species, * size = body_mass_g)) + geom_point() + scale_colour_viridis_d() ``` ] .pull-right[ <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-25-1.png" width="100%" /> ] --- ## Alpha .pull-left[ ```r ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm, colour = species, shape = species, size = body_mass_g, * alpha = flipper_length_mm)) + geom_point() + scale_colour_viridis_d() ``` ] .pull-right[ <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-26-1.png" width="100%" /> ] --- .pull-left[ **Mapping** ```r ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm, * size = body_mass_g, * alpha = flipper_length_mm)) + geom_point() ``` <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-27-1.png" width="100%" /> ] .pull-right[ **Setting** ```r ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm)) + * geom_point(size = 2, alpha = 0.5) ``` <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-28-1.png" width="100%" /> ] --- ## Mapping vs. setting - **Mapping:** Determine the size, alpha, etc. of points based on the values of a variable in the data - goes into `aes()` - **Setting:** Determine the size, alpha, etc. of points **not** based on the values of a variable in the data - goes into `geom_*()` (this was `geom_point()` in the previous example, but we'll learn about other geoms soon!) --- class: middle, inverse # Faceting --- ## Faceting - Smaller plots that display different subsets of the data - Useful for exploring conditional relationships and large data --- .panelset[ .panel[.panel-name[Plot] <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-29-1.png" width="70%" /> ] .panel[.panel-name[Code] ```r ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm)) + geom_point() + * facet_grid(species ~ island) ``` ] ] --- ## Various ways to facet .question[ In the next few slides describe what each plot displays. Think about how the code relates to the output. **Note:** The plots in the next few slides do not have proper titles, axis labels, etc. because we want you to figure out what's happening in the plots. But you should always label your plots! ] --- ```r ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm)) + geom_point() + * facet_grid(species ~ sex) ``` <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-30-1.png" width="60%" /> --- ```r ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm)) + geom_point() + * facet_grid(sex ~ species) ``` <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-31-1.png" width="60%" /> --- ```r ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm)) + geom_point() + * facet_wrap(~ species) ``` <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-32-1.png" width="60%" /> --- ```r ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm)) + geom_point() + * facet_grid(. ~ species) ``` <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-33-1.png" width="60%" /> --- ```r ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm)) + geom_point() + * facet_wrap(~ species, ncol = 2) ``` <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-34-1.png" width="60%" /> --- ## Faceting summary - `facet_grid()`: - 2d grid - `rows ~ cols` - use `.` for no split - `facet_wrap()`: 1d ribbon wrapped according to number of rows and columns specified or available plotting area --- ## Facet and color .panelset[ .panel[.panel-name[Plot] <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-35-1.png" width="60%" /> ] .panel[.panel-name[Code] ```r ggplot( penguins, aes(x = bill_depth_mm, y = bill_length_mm, * color = species)) + geom_point() + facet_grid(species ~ sex) + * scale_color_viridis_d() ``` ] ] --- ## Facet and color, no legend .panelset[ .panel[.panel-name[Plot] <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-36-1.png" width="60%" /> ] .panel[.panel-name[Code] ```r ggplot( penguins, aes(x = bill_depth_mm, y = bill_length_mm, color = species)) + * geom_point(show.legend = FALSE) + facet_grid(species ~ sex) + scale_color_viridis_d() ``` ] ] --- class: middle, inverse # Take a sad plot, and make it better --- The American Association of University Professors (AAUP) is a nonprofit membership association of faculty and other academic professionals. [This report](https://www.aaup.org/sites/default/files/files/AAUP_Report_InstrStaff-75-11_apr2013.pdf) by the AAUP shows trends in instructional staff employees between 1975 and 2011, and contains an image very similar to the one given below. <img src="images/staff-employment.png" width="80%" style="display: block; margin: auto;" /> --- Each row in this dataset represents a faculty type, and the columns are the years for which we have data. The values are percentage of hires of that type of faculty for each year. ```r staff <- read_csv("data/instructional-staff.csv") staff ``` ``` ## # A tibble: 5 × 12 ## faculty_type `1975` `1989` `1993` `1995` `1999` `2001` `2003` `2005` `2007` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Full-Time Tenu… 29 27.6 25 24.8 21.8 20.3 19.3 17.8 17.2 ## 2 Full-Time Tenu… 16.1 11.4 10.2 9.6 8.9 9.2 8.8 8.2 8 ## 3 Full-Time Non-… 10.3 14.1 13.6 13.6 15.2 15.5 15 14.8 14.9 ## 4 Part-Time Facu… 24 30.4 33.1 33.2 35.5 36 37 39.3 40.5 ## 5 Graduate Stude… 20.5 16.5 18.1 18.8 18.7 19 20 19.9 19.5 ## # … with 2 more variables: `2009` <dbl>, `2011` <dbl> ``` --- ## Recreate the visualization In order to recreate this visualization we need to first reshape the data to have one variable for faculty type and one variable for year. In other words, we will convert the data from the long format to wide format. But before we do so... .task[ If the long data will have a row for each year/faculty type combination, and there are 5 faculty types and 11 years of data, how many rows will the data have? ] --- class: center, middle <img src="images/pivot.gif" width="80%" style="display: block; margin: auto;" /> --- ## `pivot_*()` functions <img src="images/tidyr-longer-wider.gif" width="60%" /> --- ## `pivot_longer()` ```r pivot_longer(data, cols, names_to = "name", values_to = "value") ``` - The first argument is `data` as usual. - The second argument, `cols`, is where you specify which columns to pivot into longer format -- in this case all columns except for the `faculty_type` - The third argument, `names_to`, is a string specifying the name of the column to create from the data stored in the column names of data -- in this case `year` - The fourth argument, `values_to`, is a string specifying the name of the column to create from the data stored in cell values, in this case `percentage` --- ## Pivot instructor data .midi[ ```r library(tidyverse) staff_long <- staff %>% pivot_longer(cols = -faculty_type, names_to = "year", values_to = "percentage") %>% mutate(percentage = as.numeric(percentage)) staff_long ``` ``` ## # A tibble: 55 × 3 ## faculty_type year percentage ## <chr> <chr> <dbl> ## 1 Full-Time Tenured Faculty 1975 29 ## 2 Full-Time Tenured Faculty 1989 27.6 ## 3 Full-Time Tenured Faculty 1993 25 ## 4 Full-Time Tenured Faculty 1995 24.8 ## 5 Full-Time Tenured Faculty 1999 21.8 ## 6 Full-Time Tenured Faculty 2001 20.3 ## 7 Full-Time Tenured Faculty 2003 19.3 ## 8 Full-Time Tenured Faculty 2005 17.8 ## 9 Full-Time Tenured Faculty 2007 17.2 ## 10 Full-Time Tenured Faculty 2009 16.8 ## # … with 45 more rows ``` ] --- .question[ This doesn't look quite right, how would you fix it? ] .small[ ```r staff_long %>% ggplot(aes(x = percentage, y = year, color = faculty_type)) + geom_col(position = "dodge") ``` <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-42-1.png" width="60%" /> ] --- .midi[ ```r staff_long %>% ggplot(aes(x = percentage, y = year, fill = faculty_type)) + geom_col(position = "dodge") ``` <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-43-1.png" width="60%" /> ] --- ## Some improvement... .midi[ ```r staff_long %>% ggplot(aes(x = percentage, y = year, fill = faculty_type)) + geom_col() ``` <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-44-1.png" width="60%" /> ] --- ## More improvement .midi[ ```r staff_long %>% ggplot(aes(x = year, y = percentage, group = faculty_type, color = faculty_type)) + geom_line() + theme_minimal() ``` <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-45-1.png" width="100%" /> ] --- ## Goal: even more improvement! .task[ I want to achieve the following look but I have no idea how! ] <img src="images/sketch.png" width="70%" /> --- .panelset[ .panel[.panel-name[Plot] <img src="02-grammar-of-graphics_files/figure-html/instructor-lines-1.png" width="100%" /> ] .panel[.panel-name[Code] ```r library(scales) staff_long %>% * mutate( * part_time = if_else(faculty_type == "Part-Time Faculty", * "Part-Time Faculty", "Other Faculty"), * year = as.numeric(year) * ) %>% ggplot(aes(x = year, y = percentage/100, group = faculty_type, color = part_time)) + geom_line() + * scale_color_manual(values = c("gray", "red")) + * scale_y_continuous(labels = label_percent(accuracy = 1)) + theme_minimal() + labs( title = "Instructional staff employment trends", x = "Year", y = "Percentage", color = NULL ) + * theme(legend.position = "bottom") ``` ] ] --- class: middle, inverse # A/B testing --- ## Data: Sale prices of houses in Duke Forest .pull-left[ - Data on houses that were sold in the Duke Forest neighborhood of Durham, NC around November 2020 - Scraped from Zillow - Source: `openintro::duke_forest` ] .pull-right[ <img src="images/duke_forest_home.jpg" title="Home in Duke Forest" alt="Home in Duke Forest" width="100%" style="display: block; margin: auto 0 auto auto;" /> ] --- ## `openintro::duke_forest` ```r library(openintro) glimpse(duke_forest) ``` ``` ## Rows: 98 ## Columns: 13 ## $ address <chr> "1 Learned Pl, Durham, NC 27705", "1616 Pinecrest Rd, Durha… ## $ price <dbl> 1520000, 1030000, 420000, 680000, 428500, 456000, 1270000, … ## $ bed <dbl> 3, 5, 2, 4, 4, 3, 5, 4, 4, 3, 4, 4, 3, 5, 4, 5, 3, 4, 4, 3,… ## $ bath <dbl> 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 5.0, 3.0, 5.0, 2.0, 3.0, 3.0,… ## $ area <dbl> 6040, 4475, 1745, 2091, 1772, 1950, 3909, 2841, 3924, 2173,… ## $ type <chr> "Single Family", "Single Family", "Single Family", "Single … ## $ year_built <dbl> 1972, 1969, 1959, 1961, 2020, 2014, 1968, 1973, 1972, 1964,… ## $ heating <chr> "Other, Gas", "Forced air, Gas", "Forced air, Gas", "Heat p… ## $ cooling <fct> central, central, central, central, central, central, centr… ## $ parking <chr> "0 spaces", "Carport, Covered", "Garage - Attached, Covered… ## $ lot <dbl> 0.97, 1.38, 0.51, 0.84, 0.16, 0.45, 0.94, 0.79, 0.53, 0.73,… ## $ hoa <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ url <chr> "https://www.zillow.com/homedetails/1-Learned-Pl-Durham-NC-… ``` --- ## A simple visualization .panelset[ .panel[.panel-name[Code] ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_point(alpha = 0.7, size = 2) + geom_smooth(method = "lm", se = FALSE, size = 0.7) + labs( x = "Area (square feet)", y = "Sale price (USD)", title = "Price and area of houses in Duke Forest" ) ``` ] .panel[.panel-name[Plot] <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-49-1.png" width="70%" /> ] ] --- ## New variable: `decade_built` ```r duke_forest <- duke_forest %>% mutate(decade_built = (year_built %/% 10) * 10) duke_forest %>% select(year_built, decade_built) ``` ``` ## # A tibble: 98 × 2 ## year_built decade_built ## <dbl> <dbl> ## 1 1972 1970 ## 2 1969 1960 ## 3 1959 1950 ## 4 1961 1960 ## 5 2020 2020 ## 6 2014 2010 ## 7 1968 1960 ## 8 1973 1970 ## 9 1972 1970 ## 10 1964 1960 ## # … with 88 more rows ``` --- ## Distribution of `decade_built` ```r duke_forest <- duke_forest %>% mutate( decade_built = (year_built %/% 10) * 10 ) duke_forest %>% count(decade_built) ``` ``` ## # A tibble: 11 × 2 ## decade_built n ## <dbl> <int> ## 1 1920 1 ## 2 1930 2 ## 3 1940 5 ## 4 1950 26 ## 5 1960 32 ## 6 1970 11 ## 7 1980 13 ## 8 1990 1 ## 9 2000 1 ## 10 2010 5 ## 11 2020 1 ``` --- ## New variable: `decade_built_cat` ```r duke_forest <- duke_forest %>% mutate( decade_built_cat = case_when( decade_built <= 1940 ~ "1940 or before", decade_built >= 1990 ~ "1990 or after", TRUE ~ as.character(decade_built) ) ) duke_forest %>% count(decade_built_cat) ``` ``` ## # A tibble: 6 × 2 ## decade_built_cat n ## <chr> <int> ## 1 1940 or before 8 ## 2 1950 26 ## 3 1960 32 ## 4 1970 11 ## 5 1980 13 ## 6 1990 or after 8 ``` --- ## A slightly more complex visualization .panelset[ .panel[.panel-name[Code] ```r ggplot(duke_forest, aes(x = area, y = price, color = decade_built_cat)) + geom_point(alpha = 0.7, show.legend = FALSE) + geom_smooth(method = "lm", se = FALSE, size = 0.5, show.legend = FALSE) + facet_wrap(~decade_built_cat) + labs( x = "Area (square feet)", y = "Sale price (USD)", color = "Decade built", title = "Price and area of houses in Duke Forest" ) ``` ] .panel[.panel-name[Plot] <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-53-1.png" width="90%" /> ] ] --- class: middle .task[ In the next two slides, the same plots are created with different "cosmetic" choices. Examine the plots two given (Plot A and Plot B), and indicate your preference by voting for one of them in the Vote tab. ] --- ## Test 1 .panelset[ .panel[.panel-name[Plot A] <img src="02-grammar-of-graphics_files/figure-html/test-1-a-1.png" width="90%" /> ] .panel[.panel-name[Plot B] <img src="02-grammar-of-graphics_files/figure-html/test-1-b-1.png" width="90%" /> ] ] --- ## Test 2 .panelset[ .panel[.panel-name[Plot A] <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-54-1.png" width="90%" /> ] .panel[.panel-name[Plot B] <img src="02-grammar-of-graphics_files/figure-html/test-2-b-1.png" width="90%" /> ] ] --- class: middle .large[ .hand[ a deeper look at the plotting code... ] ] --- ## Minimal theme + viridis scale, default option .panelset[ .panel[.panel-name[Plot] <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-55-1.png" width="90%" /> ] .panel[.panel-name[Code] ```r ggplot(duke_forest, aes(x = area, y = price, color = decade_built_cat)) + geom_point(alpha = 0.7, show.legend = FALSE) + geom_smooth(method = "lm", se = FALSE, size = 0.5, show.legend = FALSE) + facet_wrap(~decade_built_cat) + labs( x = "Area (square feet)", y = "Sale price (USD)", color = "Decade built", title = "Price and area of houses in Duke Forest" ) + * theme_minimal(base_size = 16) + * scale_color_viridis_d(end = 0.9) ``` ] ] --- ## Viridis scale, option A (magma) .panelset[ .panel[.panel-name[Plot] <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-57-1.png" width="90%" /> ] .panel[.panel-name[Code] ```r ggplot(duke_forest, aes(x = area, y = price, color = decade_built_cat)) + geom_point(alpha = 0.5, size = 2, show.legend = FALSE) + geom_smooth(method = "lm", se = FALSE, size = 0.5, show.legend = FALSE) + facet_wrap(~decade_built_cat) + labs( x = "Area (square feet)", y = "Sale price (USD)", color = "Decade built", title = "Price and area of houses in Duke Forest" ) + * scale_color_viridis_d(end = 0.8, option = "A") ``` ] ] --- ## Dark theme + further theme customization .panelset[ .panel[.panel-name[Plot] <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-59-1.png" width="90%" /> ] .panel[.panel-name[Code] ```r ggplot(duke_forest, aes(x = area, y = price, color = decade_built_cat)) + geom_point(alpha = 0.7, show.legend = FALSE) + geom_smooth(method = "lm", se = FALSE, size = 0.5, show.legend = FALSE) + facet_wrap(~decade_built_cat) + labs( x = "Area (square feet)", y = "Sale price (USD)", color = "Decade built", title = "Price and area of houses in Duke Forest", ) + * theme_dark(base_size = 16) + * scale_color_manual(values = c("yellow", "blue", "orange", "red", "green", "white")) + * theme( * text = element_text(color = "red", face = "bold.italic"), * plot.background = element_rect(fill = "yellow") * ) ``` ] ] --- class: middle, inverse # What makes bad figures bad? --- ## Bad taste <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-61-1.png" width="90%" /> --- ## Data-to-ink ratio .pull-left-wide[ Tufte strongly recommends maximizing the **data-to-ink ratio** this in the Visual Display of Quantitative Information (Tufte, 1983). > Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency. … [It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space … [It] is nearly always multivariate … And graphical excellence requires telling the truth about the data. (Tufte, 1983, p. 51). ] .pull-right-narrow[ <img src="images/tufte-visual-display-cover.png" title="Cover of Visual Display of Quantitative Information, Tufte (1983)." alt="Cover of Visual Display of Quantitative Information, Tufte (1983)." width="100%" style="display: block; margin: auto 0 auto auto;" /> ] --- .task[ Which of the plots has higher data-to-ink ratio? ] .panelset[ .panel[.panel-name[Plot A] <img src="02-grammar-of-graphics_files/figure-html/mean-area-decade-a-1.png" width="70%" /> ] .panel[.panel-name[Plot B] <img src="02-grammar-of-graphics_files/figure-html/mean-area-decade-b-1.png" width="70%" /> ] ] --- class: middle .large[ .hand[ a deeper look at the plotting code... ] ] --- ## Summary statistics ```r mean_area_decade <- duke_forest %>% group_by(decade_built_cat) %>% summarise(mean_area = mean(area)) mean_area_decade ``` ``` ## # A tibble: 6 × 2 ## decade_built_cat mean_area ## <chr> <dbl> ## 1 1940 or before 2072. ## 2 1950 2545. ## 3 1960 2873. ## 4 1970 3413. ## 5 1980 2889. ## 6 1990 or after 2822. ``` --- ## Barplot .panelset[ .panel[.panel-name[Plot] <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-64-1.png" width="70%" /> ] .panel[.panel-name[Code] ```r ggplot(mean_area_decade, aes(y = decade_built_cat, x = mean_area)) + * geom_col() + labs( x = "Mean area (square feet)", y = "Decade built", title = "Mean area of houses in Duke Forest, by decade built" ) + theme_minimal(base_size = 16) ``` ] ] --- ## Scatterplot .panelset[ .panel[.panel-name[Plot] <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-66-1.png" width="70%" /> ] .panel[.panel-name[Code] ```r ggplot(mean_area_decade, aes(y = decade_built_cat, x = mean_area)) + * geom_point(size = 4) + labs( x = "Mean area (square feet)", y = "Decade built", title = "Mean area of houses in Duke Forest, by decade built" ) + theme_minimal(base_size = 16) ``` ] ] --- ## Lollipop plot -- a happy medium? .panelset[ .panel[.panel-name[Plot] <img src="02-grammar-of-graphics_files/figure-html/mean-area-decade-lollipop-1.png" width="70%" /> ] .panel[.panel-name[Code] ```r ggplot(mean_area_decade, aes(y = decade_built_cat, x = mean_area)) + geom_point(size = 4) + * geom_segment( * aes( * x = 0, xend = mean_area, * y = decade_built_cat, yend = decade_built_cat * ) * ) + labs( x = "Mean area (square feet)", y = "Decade built", title = "Mean area of houses in Duke Forest, by decade built" ) + theme_minimal(base_size = 16) ``` ] ] --- ## Bad data .panelset[ .panel[.panel-name[Original] <img src="images/healy-democracy-nyt-version.png" title="A crisis of faith in democracy? New York Times." alt="A crisis of faith in democracy? New York Times." width="50%" /> ] .panel[.panel-name[Improved] <img src="images/healy-democracy-voeten-version-2.png" title="A crisis of faith in democracy? New York Times." alt="A crisis of faith in democracy? New York Times." width="50%" /> ] ] .footnote[ Healy, Data Visualization: A practical introduction. [Chapter 1](https://socviz.co/lookatdata.html). Figures 1.8 and 1.9. ] --- ## Bad perception <img src="images/healy-perception-curves.png" title="Aspect ratios affect our perception of rates of change, modeled after an example by William S. Cleveland." alt="Aspect ratios affect our perception of rates of change, modeled after an example by William S. Cleveland." width="80%" /> .footnote[ Healy, Data Visualization: A practical introduction. [Chapter 1](https://socviz.co/lookatdata.html). Figure 1.12. ] --- class: middle, inverse # Aesthetic mappings in ggplot2 --- ## A second look: lollipop plot .panelset[ .panel[.panel-name[Plot] <img src="02-grammar-of-graphics_files/figure-html/mean-area-decade-lollipop-layer-1.png" width="70%" /> ] .panel[.panel-name[Code] ```r ggplot(mean_area_decade, aes(y = decade_built_cat, x = mean_area)) + geom_point(size = 4) + geom_segment(aes( x = 0, xend = mean_area, y = decade_built_cat, yend = decade_built_cat )) + labs( x = "Mean area (square feet)", y = "Decade built", title = "Mean area of houses in Duke Forest, by decade built" ) + theme_minimal(base_size = 16) ``` ] ] --- ## Activity: Spot the difference I .task[ Can you spot the differences between the code here and the one provided in the previous slide? Are there any differences in the resulting plot? Work in a pair (or group) to answer. ] .panelset[ .panel[.panel-name[Plot] <img src="02-grammar-of-graphics_files/figure-html/mean-area-decade-lollipop-global-1.png" width="50%" /> ] .panel[.panel-name[Code] ```r ggplot(mean_area_decade, aes(y = decade_built_cat, x = mean_area)) + geom_point(size = 4) + geom_segment(aes( xend = 0, yend = decade_built_cat )) + labs( x = "Mean area (square feet)", y = "Decade built", title = "Mean area of houses in Duke Forest, by decade built" ) + theme_minimal(base_size = 16) ``` ] ]
03
:
00
--- ## Global vs. layer-specific aesthetics - Aesthetic mappings can be supplied in the initial `ggplot()` call, in individual layers, or in some combination of both. - Within each layer, you can add, override, or remove mappings. - If you only have one layer in the plot, the way you specify aesthetics doesn’t make any difference. However, the distinction is important when you start adding additional layers. --- ## Wrap up .task[ Think back to all the plots you saw in the lecture, without flipping back through the slides. Which plot first comes to mind? Describe it in words. ] --- .task[ Change the theme of the following plot to something else. See https://ggplot2.tidyverse.org/reference/theme.html for options. Make other improvements as you see fit. ] <img src="02-grammar-of-graphics_files/figure-html/unnamed-chunk-75-1.png" width="60%" />