class: center, middle, inverse, title-slide # 02
teaching the tidyverse ## 🧹 tidy up your teaching!
đź”—
bit.ly/teach-ds-wsc
###
dr. mine çetinkaya-rundel
dr. colin rundel ### 23 june 2021 --- class: middle, inverse # What, why, how? --- class: middle # What is the tidyverse? --- ## What is the tidyverse? The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. - **ggplot2** - data visualisation - **dplyr** - data manipulation - **tidyr** - tidy data - **readr** - read rectangular data - **purrr** - functional programming - **tibble** - modern data frames - **stringr** - string manipulation - **forcats** - factors - and many more ... --- ## Tidy data <img src="img/tidy-data-frame.png" title="tidy data diagram" alt="tidy data diagram" width="614" /> 1. Each variable must have its own column. 1. Each observation must have its own row. 1. Each value must have its own cell. .footnote[ Source: R for Data Science. Grolemund and Wickham. ] --- ## Tidy data + Tidyverse references .pull-left[ <img src="img/tidy-papers.png" title="tidy papers screenshot" alt="tidy papers screenshot" width="458" /> ] .pull-right[ - Wickham (2014). **Tidy data.** Journal of Statistical Software, 59(10), 1-23. - Wickham et al. (2019). **Welcome to the Tidyverse.** Journal of Open Source Software, 4(43), 1686. ] --- ## Pipe operator (`magrittr`) > I want to find my keys, then start my car, then drive to work, then park my car. -- - Nested ```r park(drive(start_car(find("keys")), to = "work")) ``` -- - **Piped** ```r find("keys") %>% start_car() %>% drive(to = "work") %>% park() ``` --- ## `magrittr` vs native pipe As of R 4.1.0 there is now a native pipe operator in R (`|>`) which is very similar to magrittr's (`%>%`). For teaching purposes we would strongly recommend using magrittr for the foreseeable future. - `|>` only supports piping to the first argument (no support for `.`) - For most use cases, package dependencies are easier than R version dependencies --- class: middle, center # Why tidyverse? --- ## Recoding a binary variable .pull-left[ ### Base R ```r mtcars$transmission <- ifelse( mtcars$am == 0, "automatic", "manual" ) ``` ] .pull-right[ ### Tidyverse ```r mtcars <- mtcars %>% mutate( transmission = case_when( am == 0 ~ "automatic", am == 1 ~ "manual" ) ) ``` ] --- ## Recoding a multi-level variable .pull-left[ ### Base R ```r mtcars$gear_char <- ifelse( mtcars$gear == 3, "three", ifelse( mtcars$gear == 4, "four", "five" ) ) ``` ] .pull-right[ ### Tidyverse ```r mtcars <- mtcars %>% mutate( gear_char = case_when( gear == 3 ~ "three", gear == 4 ~ "four", gear == 5 ~ "five" ) ) ``` ] --- ## Visualising multiple variables ### Tidyverse .small[ ```r ggplot( mtcars, aes(x = disp, y = mpg, color = transmission) ) + geom_point() ``` <img src="02-teach-tidyverse_files/figure-html/unnamed-chunk-9-1.png" width="100%" /> ] --- ## Visualising even more variables ### Tidyverse .small[ ```r ggplot( mtcars, aes(x = disp, y = mpg, color = transmission) ) + geom_point() + facet_wrap(~ cyl) ``` <img src="02-teach-tidyverse_files/figure-html/unnamed-chunk-10-1.png" width="100%" /> ] --- ### Base R .small[ ```r mtcars$trans_color <- ifelse(mtcars$transmission == "automatic", "green", "blue") mtcars_cyl4 = mtcars[mtcars$cyl == 4, ] mtcars_cyl6 = mtcars[mtcars$cyl == 6, ] mtcars_cyl8 = mtcars[mtcars$cyl == 8, ] par(mfrow = c(1, 3), mar = c(2.5, 2.5, 2, 0), mgp = c(1.5, 0.5, 0)) plot(mpg ~ disp, data = mtcars_cyl4, col = trans_color, main = "Cyl 4") plot(mpg ~ disp, data = mtcars_cyl6, col = trans_color, main = "Cyl 6") plot(mpg ~ disp, data = mtcars_cyl8, col = trans_color, main = "Cyl 8") legend("topright", legend = c("automatic", "manual"), pch = 1, col = c("green", "blue")) ``` <img src="02-teach-tidyverse_files/figure-html/unnamed-chunk-11-1.png" width="100%" /> ] --- ## Benefits of starting with the tidyverse - More (human) readable syntax - More consistent syntax - Ease of multivariate visualizations - Data tidying/rectangling without advanced programming - Growth opportunities: - dplyr -> SQL / Spark / etc - purrr -> functional programming - modeling -> tidymodels --- class: middle # How tidyverse? --- .discussion[ How do you start your lessons? Why? - `library(tidyverse)` - `library(ggplot2)`, `library(dplyr)`, etc. ] --- ### .pink[ Sample slide ] ## ggplot2 `\(\in\)` tidyverse .pull-left[ <img src="img/ggplot2-part-of-tidyverse.png" width="80%" /> ] .pull-right[ - **ggplot2** is tidyverse's data visualization package - The `gg` in "ggplot2" stands for Grammar of Graphics - It is inspired by the book **Grammar of Graphics** by Leland Wilkinson ] --- class: middle # Start with ggplot2 --- ## Why start with ggplot2? -- 1. Students come in with intuition for being able to interpret data visualizations without needing much instructions. - Focus the majority of class time initially on syntax and leave interpretations to students. - Later on the scale tips -- spend more class time on concepts and results interpretations and less on syntax. -- 1. It can be easier for students to detect mistakes in visualizations compared to those in wrangling or modeling. --- ## What next? It depends on the course and subject matter, but generally data munging with dplyr is a good next step. Some general guidance, - Start with a small subset of verbs (e.g. `select()`, `filter()`, `mutate()`) - Aim to quickly get to `group_by()` and `summarize()` as this is where the action is. - Connecting munging back to data visualization tends to be more motivating than generating numerical summaries. - Data cleaning provides opportunities to introduce additional packages (e.g. `stringr`, `forcats`) --- class: middle, inverse # Teaching the tidyverse in 2021 --- class: middle # Reshaping data --- ## Instructional staff employment trends The American Association of University Professors (AAUP) is a nonprofit membership association of faculty and other academic professionals. [This report](https://www.aaup.org/sites/default/files/files/AAUP_Report_InstrStaff-75-11_apr2013.pdf) by the AAUP shows trends in instructional staff employees between 1975 and 2011, and contains an image very similar to the one given below. <img src="img/staff-employment.png" title="staff employment figure" alt="staff employment figure" width="50%" style="display: block; margin: auto;" /> --- ## Data Each row in this dataset represents a faculty type, and the columns are the years for which we have data. The values are percentage of hires of that type of faculty for each year. .small[ ```r (staff <- read_csv("data/instructional-staff.csv")) ``` ``` ## # A tibble: 5 x 12 ## faculty_type `1975` `1989` `1993` `1995` `1999` `2001` `2003` `2005` `2007` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Full-Time Tenu… 29 27.6 25 24.8 21.8 20.3 19.3 17.8 17.2 ## 2 Full-Time Tenu… 16.1 11.4 10.2 9.6 8.9 9.2 8.8 8.2 8 ## 3 Full-Time Non-… 10.3 14.1 13.6 13.6 15.2 15.5 15 14.8 14.9 ## 4 Part-Time Facu… 24 30.4 33.1 33.2 35.5 36 37 39.3 40.5 ## 5 Graduate Stude… 20.5 16.5 18.1 18.8 18.7 19 20 19.9 19.5 ## # … with 2 more variables: 2009 <dbl>, 2011 <dbl> ``` ] --- ## Recreate the visualization - In order to recreate this visualization we need to first reshape the data: - one variable for faculty type - one variable for year - Convert the data from the wide format to long format -- .discussion[ How would you approach this problem? - `gather()`/`spread()` - `pivot_wider()`/ `pivot_longer()` - Something else? ] --- class: center, middle <img src="img/pivot.gif" title="pivot friends meme" alt="pivot friends meme" width="80%" style="display: block; margin: auto;" /> --- ## `pivot_*()` functions <img src="img/tidyr-longer-wider.gif" title="pivot function animation" alt="pivot function animation" /> --- But before we do so... **Question:** If the long data will have a row for each year/faculty type combination, and there are 5 faculty types and 11 years of data, how many rows will the data have? --- ## Pivot staff data .small[ ```r (staff_long <- staff %>% pivot_longer( cols = -faculty_type, # columns to pivot names_to = "year", # name of new column for variable names values_to = "percentage" # name of new column for values ) %>% mutate( percentage = as.numeric(percentage) ) ) ``` ``` ## # A tibble: 55 x 3 ## faculty_type year percentage ## <chr> <chr> <dbl> ## 1 Full-Time Tenured Faculty 1975 29 ## 2 Full-Time Tenured Faculty 1989 27.6 ## 3 Full-Time Tenured Faculty 1993 25 ## 4 Full-Time Tenured Faculty 1995 24.8 ## 5 Full-Time Tenured Faculty 1999 21.8 ## 6 Full-Time Tenured Faculty 2001 20.3 ## 7 Full-Time Tenured Faculty 2003 19.3 ## 8 Full-Time Tenured Faculty 2005 17.8 ## 9 Full-Time Tenured Faculty 2007 17.2 ## 10 Full-Time Tenured Faculty 2009 16.8 ## # … with 45 more rows ``` ] --- ## Meh .midi[ ```r ggplot(staff_long, aes(x = percentage, y = year, fill = faculty_type)) + geom_col(position = "dodge") ``` <!-- --> ] --- ## Some improvement... .midi[ ```r ggplot(staff_long, aes(x = percentage, y = year, fill = faculty_type)) + geom_col() ``` <!-- --> ] --- ## More improvement <!-- --> --- .midi[ ```r staff_long %>% mutate( part_time = if_else(faculty_type == "Part-Time Faculty", "Part-Time Faculty", "Other Faculty"), year = as.numeric(year) ) %>% ggplot( aes(x = year, y = percentage/100, group = faculty_type, color = part_time) ) + geom_line() + scale_color_manual(values = c("gray", "red")) + scale_y_continuous(labels = label_percent(accuracy = 1)) + theme_minimal() + labs( title = "Instructional staff employment trends", x = "Year", y = "Percentage", color = NULL ) + theme(legend.position = "bottom") ``` ] --- class: middle # When to purrr? --- ## Data manipulation with purrr (or not?) - purrr is a package for functional programming with the tidyverse - If you picked up the tidyverse >2 years ago, purrr was commonly used for data science tasks that involve iteration - In 2021, it's possible to do many of these data science tasks with dplyr and tidyr, these approaches are often more approachable to new learners -- .discussion[ How familiar are you with the purrr package? Have you taught purrr in your data science courses? ] --- ## Ex 1. Flattening JSON files We have data on lego sales and some information on the buyers in JSON format. We want to covert it into a tidy data frame. .tiny[ ```r sales <- read_rds("data/lego_sales.rds") jsonlite::toJSON(sales[1], pretty = TRUE) ``` ``` ## [ ## { ## "gender": ["Female"], ## "first_name": ["Kimberly"], ## "last_name": ["Beckstead"], ## "age": [24], ## "phone_number": ["216-555-2549"], ## "purchases": [ ## { ## "SetID": [24701], ## "Number": ["76062"], ## "Theme": ["DC Comics Super Heroes"], ## "Subtheme": ["Mighty Micros"], ## "Year": [2016], ## "Name": ["Robin vs. Bane"], ## "Pieces": [77], ## "USPrice": [9.99], ## "ImageURL": ["http://images.brickset.com/sets/images/76062-1.jpg"], ## "Quantity": [1] ## } ## ] ## } ## ] ``` ] --- ## purrr solution .small[ ```r sales %>% purrr::map_dfr( function(l) { purchases <- purrr::map_dfr(l$purchases, ~.) l$purchases <- NULL bind_cols(as_tibble(l), purchases) } ) ``` ] --- ## purr solution .small[ ``` ## # A tibble: 620 x 15 ## gender first_name last_name age phone_number SetID Number Theme Subtheme ## <chr> <chr> <chr> <dbl> <chr> <int> <chr> <chr> <chr> ## 1 Female Kimberly Beckstead 24 216-555-2549 24701 76062 DC Co… "Mighty … ## 2 Male Neel Garvin 35 819-555-3189 25626 70595 Ninja… "Rise of… ## 3 Male Neel Garvin 35 819-555-3189 24665 21031 Archi… "" ## 4 Female Chelsea Bouchard 41 <NA> 24695 31048 Creat… "" ## 5 Female Chelsea Bouchard 41 <NA> 25626 70595 Ninja… "Rise of… ## 6 Female Chelsea Bouchard 41 <NA> 24721 10831 Duplo "" ## 7 Female Bryanna Welsh 19 <NA> 24797 75138 Star … "Episode… ## 8 Female Bryanna Welsh 19 <NA> 24701 76062 DC Co… "Mighty … ## 9 Male Caleb Garcia-Wi… 37 907-555-9236 24730 41115 Frien… "" ## 10 Male Caleb Garcia-Wi… 37 907-555-9236 25611 21127 Minec… "Minifig… ## # … with 610 more rows, and 6 more variables: Year <int>, Name <chr>, ## # Pieces <int>, USPrice <dbl>, ImageURL <chr>, Quantity <dbl> ``` ] --- ## A tidyr solution .small[ ```r tibble(sales = sales) %>% unnest_wider(sales) %>% unnest_longer(purchases) %>% unnest_wider(purchases) ``` ``` ## # A tibble: 620 x 15 ## gender first_name last_name age phone_number SetID Number Theme Subtheme ## <chr> <chr> <chr> <dbl> <chr> <int> <chr> <chr> <chr> ## 1 Female Kimberly Beckstead 24 216-555-2549 24701 76062 DC Co… "Mighty … ## 2 Male Neel Garvin 35 819-555-3189 25626 70595 Ninja… "Rise of… ## 3 Male Neel Garvin 35 819-555-3189 24665 21031 Archi… "" ## 4 Female Chelsea Bouchard 41 <NA> 24695 31048 Creat… "" ## 5 Female Chelsea Bouchard 41 <NA> 25626 70595 Ninja… "Rise of… ## 6 Female Chelsea Bouchard 41 <NA> 24721 10831 Duplo "" ## 7 Female Bryanna Welsh 19 <NA> 24797 75138 Star … "Episode… ## 8 Female Bryanna Welsh 19 <NA> 24701 76062 DC Co… "Mighty … ## 9 Male Caleb Garcia-Wi… 37 907-555-9236 24730 41115 Frien… "" ## 10 Male Caleb Garcia-Wi… 37 907-555-9236 25611 21127 Minec… "Minifig… ## # … with 610 more rows, and 6 more variables: Year <int>, Name <chr>, ## # Pieces <int>, USPrice <dbl>, ImageURL <chr>, Quantity <dbl> ``` ] --- ## tidyr solution (Step 1) .small[ ```r tibble(sales = sales) ``` ``` ## # A tibble: 250 x 1 ## sales ## <list> ## 1 <named list [6]> ## 2 <named list [6]> ## 3 <named list [5]> ## 4 <named list [5]> ## 5 <named list [6]> ## 6 <named list [6]> ## 7 <named list [6]> ## 8 <named list [6]> ## 9 <named list [6]> ## 10 <named list [6]> ## # … with 240 more rows ``` ] --- ## tidyr solution (Step 2) .small[ ```r tibble(sales = sales) %>% unnest_wider(sales) ``` ``` ## # A tibble: 250 x 6 ## gender first_name last_name age phone_number purchases ## <chr> <chr> <chr> <dbl> <chr> <list> ## 1 Female Kimberly Beckstead 24 216-555-2549 <list [1]> ## 2 Male Neel Garvin 35 819-555-3189 <list [2]> ## 3 Female Chelsea Bouchard 41 <NA> <list [3]> ## 4 Female Bryanna Welsh 19 <NA> <list [2]> ## 5 Male Caleb Garcia-Wideman 37 907-555-9236 <list [2]> ## 6 Male Chase Fortenberry 19 205-555-3704 <list [2]> ## 7 Male Kevin Cruz 20 947-555-7946 <list [1]> ## 8 Male Connor Brown 36 516-555-4310 <list [3]> ## 9 Female Toni Borison 40 284-555-4560 <list [2]> ## 10 Male Daniel Hurst 44 251-555-0845 <list [2]> ## # … with 240 more rows ``` ] --- ## tidyr solution (Step 3) .small[ ```r tibble(sales = sales) %>% unnest_wider(sales) %>% unnest_longer(purchases) ``` ``` ## # A tibble: 620 x 6 ## gender first_name last_name age phone_number purchases ## <chr> <chr> <chr> <dbl> <chr> <list> ## 1 Female Kimberly Beckstead 24 216-555-2549 <named list [10]> ## 2 Male Neel Garvin 35 819-555-3189 <named list [10]> ## 3 Male Neel Garvin 35 819-555-3189 <named list [10]> ## 4 Female Chelsea Bouchard 41 <NA> <named list [10]> ## 5 Female Chelsea Bouchard 41 <NA> <named list [10]> ## 6 Female Chelsea Bouchard 41 <NA> <named list [10]> ## 7 Female Bryanna Welsh 19 <NA> <named list [10]> ## 8 Female Bryanna Welsh 19 <NA> <named list [10]> ## 9 Male Caleb Garcia-Wideman 37 907-555-9236 <named list [10]> ## 10 Male Caleb Garcia-Wideman 37 907-555-9236 <named list [10]> ## # … with 610 more rows ``` ] --- ## tidyr solution (Step 4) .small[ ```r tibble(sales = sales) %>% unnest_wider(sales) %>% unnest_longer(purchases) %>% unnest_wider(purchases) ``` ``` ## # A tibble: 620 x 15 ## gender first_name last_name age phone_number SetID Number Theme Subtheme ## <chr> <chr> <chr> <dbl> <chr> <int> <chr> <chr> <chr> ## 1 Female Kimberly Beckstead 24 216-555-2549 24701 76062 DC Co… "Mighty … ## 2 Male Neel Garvin 35 819-555-3189 25626 70595 Ninja… "Rise of… ## 3 Male Neel Garvin 35 819-555-3189 24665 21031 Archi… "" ## 4 Female Chelsea Bouchard 41 <NA> 24695 31048 Creat… "" ## 5 Female Chelsea Bouchard 41 <NA> 25626 70595 Ninja… "Rise of… ## 6 Female Chelsea Bouchard 41 <NA> 24721 10831 Duplo "" ## 7 Female Bryanna Welsh 19 <NA> 24797 75138 Star … "Episode… ## 8 Female Bryanna Welsh 19 <NA> 24701 76062 DC Co… "Mighty … ## 9 Male Caleb Garcia-Wi… 37 907-555-9236 24730 41115 Frien… "" ## 10 Male Caleb Garcia-Wi… 37 907-555-9236 25611 21127 Minec… "Minifig… ## # … with 610 more rows, and 6 more variables: Year <int>, Name <chr>, ## # Pieces <int>, USPrice <dbl>, ImageURL <chr>, Quantity <dbl> ``` ] --- ## dplyr improvements Another common use case for purrr has been working across rows and/or columns of a data frames. Much of this functionality is now available directly in dplyr via the `across()` and `rowwise()` functions. Additional details and examples are availble in the vignettes: - [column-wise operations vignette](https://dplyr.tidyverse.org/articles/colwise.html) - [row-wise operations vignette](https://dplyr.tidyverse.org/articles/rowwise.html) and the dplyr 1.0.0 release blog posts: - [working across columns](https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/) - [working within rows](https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-rowwise/) --- class: middle # Resources --- ## Recommended reading - Keep up to date with the [tidyverse blog](https://www.tidyverse.org/blog/) *for packages you teach* - Four part blog series: Teaching the Tidyverse from 2020 - [Part 1: Getting started](https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-1-getting-started/) - [Part 2: Data visualisation](https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-2-data-visualisation/) - [Part 3: Data wrangling and tidying](https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-3-data-wrangling-and-tidying/) - [Part 4: When to purrr?](https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-4-when-to-purrr/) --- ## The larger tidy ecosystem .hand[Just to name a few...] - [janitor](https://garthtarr.github.io/meatR/janitor.html) - [kableExtra](https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html) - [patchwork](https://patchwork.data-imaginist.com/) - [gghighlight](https://cran.r-project.org/web/packages/gghighlight/vignettes/gghighlight.html) - [tidybayes](https://mjskay.github.io/tidybayes/)