class: center, middle, inverse, title-slide # 03 - geoms ## Data visualization in R ###
dr. mine çetinkaya-rundel
duke university & rstudio --- class: middle, inverse # 🔗 [bit.ly/dataviz-enar-2022](https://bit.ly/dataviz-enar-2022) To follow along with the exercises, open and make a permanent copy of the RStudio Cloud project at https://rstudio.cloud/project/3796661. --- class: middle, inverse # Setup --- ## Packages ```r # load packages library(tidyverse) library(openintro) ``` --- ## ggplot2 theme ```r # set default theme for ggplot2 ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16)) ``` --- ## Figure sizing For more on including figures in R Markdown documents with the right size, resolution, etc. the following resources are great: - [R for Data Science - Graphics for communication](https://r4ds.had.co.nz/graphics-for-communication.html) - [Tips and tricks for working with images and figures in R Markdown documents](https://www.zevross.com/blog/2017/06/19/tips-and-tricks-for-working-with-images-and-figures-in-r-markdown-documents/) ```r # set default figure parameters for knitr knitr::opts_chunk$set( fig.width = 8, # 8" fig.asp = 0.618, # the golden ratio fig.retina = 3, # dpi multiplier for displaying HTML output on retina dpi = 300, # higher dpi, sharper image out.width = "60%" ) ``` --- ## Data prep: new variables ```r duke_forest <- duke_forest %>% mutate( decade_built = (year_built %/% 10) * 10, decade_built_cat = case_when( decade_built <= 1940 ~ "1940 or before", decade_built >= 1990 ~ "1990 or after", TRUE ~ as.character(decade_built) ), decade_built_cat = factor(decade_built_cat, ordered = TRUE) ) duke_forest %>% select(year_built, decade_built, decade_built_cat) ``` ``` ## # A tibble: 98 × 3 ## year_built decade_built decade_built_cat ## <dbl> <dbl> <ord> ## 1 1972 1970 1970 ## 2 1969 1960 1960 ## 3 1959 1950 1950 ## 4 1961 1960 1960 ## 5 2020 2020 1990 or after ## 6 2014 2010 1990 or after ## 7 1968 1960 1960 ## 8 1973 1970 1970 ## 9 1972 1970 1970 ## 10 1964 1960 1960 ## # … with 88 more rows ``` --- ## Data prep: summary table ```r mean_area_decade <- duke_forest %>% group_by(decade_built_cat) %>% summarise(mean_area = mean(area)) mean_area_decade ``` ``` ## # A tibble: 6 × 2 ## decade_built_cat mean_area ## <ord> <dbl> ## 1 1940 or before 2072. ## 2 1950 2545. ## 3 1960 2873. ## 4 1970 3413. ## 5 1980 2889. ## 6 1990 or after 2822. ``` --- class: middle, inverse # Geoms --- ## Geoms - Geometric objects, or geoms for short, perform the actual rendering of the layer, controlling the type of plot that you create - You can think of them as "the geometric shape used to represent the data" --- ## One variable - Discrete: - `geom_bar()`: display distribution of discrete variable. - Continuous - `geom_histogram()`: bin and count continuous variable, display with bars - `geom_density()`: smoothed density estimate - `geom_dotplot()`: stack individual points into a dot plot - `geom_freqpoly()`: bin and count continuous variable, display with lines --- ## .hand[aside...] Always use "typewriter text" (monospace font) when writing function names, and follow with `()`, e.g., - `geom_freqpoly()` - `mean()` - `lm()` --- ## `geom_bar()` ```r ggplot(duke_forest, aes(x = decade_built_cat)) + geom_bar() ``` <img src="03-geoms_files/figure-html/unnamed-chunk-6-1.png" width="60%" /> --- ## `geom_bar()` ```r ggplot(duke_forest, aes(y = decade_built_cat)) + geom_bar() ``` <img src="03-geoms_files/figure-html/unnamed-chunk-7-1.png" width="60%" /> --- ## `geom_histogram()` ```r ggplot(duke_forest, aes(x = price)) + geom_histogram() ``` <img src="03-geoms_files/figure-html/unnamed-chunk-8-1.png" width="60%" /> --- ## `geom_histogram()` and `binwidth` .panelset[ .panel[.panel-name[20K] ```r ggplot(duke_forest, aes(x = price)) + geom_histogram(binwidth = 20000) ``` <img src="03-geoms_files/figure-html/unnamed-chunk-9-1.png" width="60%" /> ] .panel[.panel-name[50K] ```r ggplot(duke_forest, aes(x = price)) + geom_histogram(binwidth = 50000) ``` <img src="03-geoms_files/figure-html/unnamed-chunk-10-1.png" width="60%" /> ] .panel[.panel-name[100K] ```r ggplot(duke_forest, aes(x = price)) + geom_histogram(binwidth = 100000) ``` <img src="03-geoms_files/figure-html/unnamed-chunk-11-1.png" width="60%" /> ] .panel[.panel-name[200K] ```r ggplot(duke_forest, aes(x = price)) + geom_histogram(binwidth = 200000) ``` <img src="03-geoms_files/figure-html/unnamed-chunk-12-1.png" width="60%" /> ] ] --- ## `geom_density()` ```r ggplot(duke_forest, aes(x = price)) + geom_density() ``` <img src="03-geoms_files/figure-html/unnamed-chunk-13-1.png" width="60%" /> --- ## `geom_density()` and bandwidth (`bw`) .panelset[ .panel[.panel-name[1] ```r ggplot(duke_forest, aes(x = price)) + geom_density(bw = 1) ``` <img src="03-geoms_files/figure-html/unnamed-chunk-14-1.png" width="60%" /> ] .panel[.panel-name[1000] ```r ggplot(duke_forest, aes(x = price)) + geom_density(bw = 1000) ``` <img src="03-geoms_files/figure-html/unnamed-chunk-15-1.png" width="60%" /> ] .panel[.panel-name[50000] ```r ggplot(duke_forest, aes(x = price)) + geom_density(bw = 50000) ``` <img src="03-geoms_files/figure-html/unnamed-chunk-16-1.png" width="60%" /> ] .panel[.panel-name[500000] ```r ggplot(duke_forest, aes(x = price)) + geom_density(bw = 500000) ``` <img src="03-geoms_files/figure-html/unnamed-chunk-17-1.png" width="60%" /> ] ] --- ## `geom_density()` outlines .panelset[ .panel[.panel-name[full] ```r ggplot(duke_forest, aes(x = price)) + geom_density(outline.type = "full") ``` <img src="03-geoms_files/figure-html/unnamed-chunk-18-1.png" width="60%" /> ] .panel[.panel-name[both] ```r ggplot(duke_forest, aes(x = price)) + geom_density(outline.type = "both") ``` <img src="03-geoms_files/figure-html/unnamed-chunk-19-1.png" width="60%" /> ] .panel[.panel-name[upper] ```r ggplot(duke_forest, aes(x = price)) + geom_density(outline.type = "upper") ``` <img src="03-geoms_files/figure-html/unnamed-chunk-20-1.png" width="60%" /> ] .panel[.panel-name[lower] ```r ggplot(duke_forest, aes(x = price)) + geom_density(outline.type = "lower") ``` <img src="03-geoms_files/figure-html/unnamed-chunk-21-1.png" width="60%" /> ] ] --- ## `geom_dotplot()` .task[ What does each point represent? How are their locations determined? What do the x and y axes represent? ] ```r ggplot(duke_forest, aes(x = price)) + geom_dotplot(binwidth = 50000) ``` <img src="03-geoms_files/figure-html/unnamed-chunk-22-1.png" width="60%" />
03
:
00
--- ## `geom_freqpoly()` ```r ggplot(duke_forest, aes(x = price)) + geom_freqpoly(binwidth = 50000) ``` <img src="03-geoms_files/figure-html/unnamed-chunk-24-1.png" width="60%" /> --- ## `geom_freqpoly()` for comparisons .panelset[ .panel[.panel-name[Histogram] ```r ggplot(duke_forest, aes(x = price, fill = decade_built_cat)) + geom_histogram(binwidth = 100000) ``` <img src="03-geoms_files/figure-html/unnamed-chunk-25-1.png" width="60%" /> ] .panel[.panel-name[Frequency polygon] ```r ggplot(duke_forest, aes(x = price, color = decade_built_cat)) + geom_freqpoly(binwidth = 100000, size = 1) ``` <img src="03-geoms_files/figure-html/unnamed-chunk-26-1.png" width="60%" /> ] ] --- ## Two variables - both continuous - `geom_point()`: scatterplot - `geom_quantile()`: smoothed quantile regression - `geom_rug()`: marginal rug plots - `geom_smooth()`: smoothed line of best fit - `geom_text()`: text labels --- ## `geom_rug()` ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_point() + geom_rug() ``` <img src="03-geoms_files/figure-html/unnamed-chunk-27-1.png" width="60%" /> --- ## `geom_rug()` on the outside ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_point() + geom_rug(outside = TRUE) + coord_cartesian(clip = "off") ``` <img src="03-geoms_files/figure-html/unnamed-chunk-28-1.png" width="60%" /> --- ## `geom_rug()` on the outside, but better ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_point() + geom_rug(outside = TRUE, sides = "tr") + coord_cartesian(clip = "off") + theme(plot.margin = margin(1, 1, 1, 1, "cm")) ``` <img src="03-geoms_files/figure-html/unnamed-chunk-29-1.png" width="60%" /> --- ## `geom_text()` ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_text(aes(label = bed)) ``` <img src="03-geoms_files/figure-html/unnamed-chunk-30-1.png" width="60%" /> --- ## `geom_text()` and more ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_text(aes(label = bed, size = bed, color = bed)) ``` <img src="03-geoms_files/figure-html/unnamed-chunk-31-1.png" width="60%" /> --- ## `geom_text()` and even more ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_text( aes(label = bed, size = bed, color = bed), show.legend = FALSE ) ``` <img src="03-geoms_files/figure-html/unnamed-chunk-32-1.png" width="60%" /> --- ## Two variables - show distribution - `geom_bin2d()`: bin into rectangles and count - `geom_density2d()`: smoothed 2d density estimate - `geom_hex()`: bin into hexagons and count --- ## `geom_hex()` ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_hex() ``` <img src="03-geoms_files/figure-html/unnamed-chunk-33-1.png" width="60%" /> --- ## `geom_hex()` and warnings - Requires installing the [**hexbin**](https://cran.r-project.org/web/packages/hexbin/index.html) package separately! ```r install.packages("hexbin") ``` - Otherwise you might see ``` Warning: Computation failed in `stat_binhex()` ``` --- ## Two variables - At least one discrete - `geom_count()`: count number of point at distinct locations - `geom_jitter()`: randomly jitter overlapping points - One continuous, one discrete - `geom_col()`: a bar chart of pre-computed summaries - `geom_boxplot()`: boxplots - `geom_violin()`: show density of values in each group --- ## `geom_jitter()` .task[ How are the following three plots different? ] .panelset[ .panel[.panel-name[Plot A] ```r ggplot(duke_forest, aes(x = bed, y = price)) + geom_point() ``` <img src="03-geoms_files/figure-html/unnamed-chunk-35-1.png" width="60%" /> ] .panel[.panel-name[Plot B] ```r ggplot(duke_forest, aes(x = bed, y = price)) + geom_jitter() ``` <img src="03-geoms_files/figure-html/unnamed-chunk-36-1.png" width="60%" /> ] .panel[.panel-name[Plot C] ```r ggplot(duke_forest, aes(x = bed, y = price)) + geom_jitter() ``` <img src="03-geoms_files/figure-html/unnamed-chunk-37-1.png" width="60%" /> ] ]
03
:
00
--- ## `geom_jitter()` and `set.seed()` .panelset[ .panel[.panel-name[Plot A] ```r set.seed(1234) ggplot(duke_forest, aes(x = bed, y = price)) + geom_jitter() ``` <img src="03-geoms_files/figure-html/unnamed-chunk-39-1.png" width="60%" /> ] .panel[.panel-name[Plot B] ```r set.seed(1234) ggplot(duke_forest, aes(x = bed, y = price)) + geom_jitter() ``` <img src="03-geoms_files/figure-html/unnamed-chunk-40-1.png" width="60%" /> ] ] --- ## Two variables - One time, one continuous - `geom_area()`: area plot - `geom_line()`: line plot - `geom_step()`: step plot - Display uncertainty: - `geom_crossbar()`: vertical bar with center - `geom_errorbar()`: error bars - `geom_linerange()`: vertical line - `geom_pointrange()`: vertical line with center - Spatial - `geom_map()`: fast version of `geom_polygon()` for map data (more on this later...) --- ## Average price per year built ```r mean_price_year <- duke_forest %>% group_by(year_built) %>% summarise( n = n(), mean_price = mean(price), sd_price = sd(price) ) mean_price_year ``` ``` ## # A tibble: 44 × 4 ## year_built n mean_price sd_price ## <dbl> <int> <dbl> <dbl> ## 1 1923 1 285000 NA ## 2 1934 1 600000 NA ## 3 1938 1 265000 NA ## 4 1940 1 105000 NA ## 5 1941 2 432500 28284. ## 6 1945 2 525000 530330. ## 7 1951 2 567500 258094. ## 8 1952 2 531250 469165. ## 9 1953 2 575000 35355. ## 10 1954 4 600000 33912. ## # … with 34 more rows ``` --- ## `geom_line()` ```r ggplot(mean_price_year, aes(x = year_built, y = mean_price)) + geom_line() ``` <img src="03-geoms_files/figure-html/unnamed-chunk-42-1.png" width="60%" /> --- ## `geom_area()` ```r ggplot(mean_price_year, aes(x = year_built, y = mean_price)) + geom_area() ``` <img src="03-geoms_files/figure-html/unnamed-chunk-43-1.png" width="60%" /> --- ## `geom_step()` ```r ggplot(mean_price_year, aes(x = year_built, y = mean_price)) + geom_step() ``` <img src="03-geoms_files/figure-html/unnamed-chunk-44-1.png" width="60%" /> --- ## `geom_errorbar()` .task[ Describe how this plot is constructed and what the points and the lines (error bars) correspond to. ] .panelset[ .panel[.panel-name[Code] ```r ggplot(mean_price_year, aes(x = year_built, y = mean_price)) + geom_point() + geom_errorbar(aes(ymin = mean_price - sd_price, ymax = mean_price + sd_price)) ``` ] .panel[.panel-name[Plot] <img src="03-geoms_files/figure-html/unnamed-chunk-45-1.png" width="60%" /> ] ]
03
:
00
--- ## Let's clean things up a bit! Meet your new best friend, the [**scales**](https://scales.r-lib.org/) package! ```r library(scales) ``` --- ## Let's clean things up a bit! .panelset[ .panel[.panel-name[Code] ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_point(alpha = 0.6, size = 2, color = "#012169") + scale_x_continuous(labels = label_number(big.mark = ",")) + scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K")) + labs( x = "Area (square feet)", y = "Sale price (USD)", title = "Sale prices of homes in Duke Forest", subtitle = "As of November 2020", caption = "Source: Zillow.com" ) ``` ] .panel[.panel-name[Plot] <img src="03-geoms_files/figure-html/unnamed-chunk-48-1.png" width="60%" /> ] ] --- .task[ Find a new geom that we haven't introduced so far and use it to visualize the `duke_forest` data. ]