02 teaching the tidyverse

# 02 <br> teaching the tidyverse
## 🧹 tidy up your teaching! <br> 🔗 <a href="https://bit.ly/teach-ds-wsc">bit.ly/teach-ds-wsc</a>
### <br> dr. mine çetinkaya-rundel <br> dr. colin rundel
### 23 june 2021

---

# What, why, how?

---

# What is the tidyverse?

---

## What is the tidyverse?

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

- **ggplot2** - data visualisation
- **dplyr** - data manipulation
- **tidyr** - tidy data
- **readr** - read rectangular data
- **purrr** - functional programming
- **tibble** - modern data frames
- **stringr** - string manipulation
- **forcats** - factors
- and many more ...

---

## Tidy data

1. Each variable must have its own column.

1. Each observation must have its own row.

1. Each value must have its own cell.

---

## Tidy data + Tidyverse references

.pull-left[
<img src="img/tidy-papers.png" title="tidy papers screenshot" alt="tidy papers screenshot" width="458" />
]
.pull-right[
- Wickham (2014). **Tidy data.** Journal of Statistical Software, 59(10), 1-23.

- Wickham et al. (2019). **Welcome to the Tidyverse.** Journal of Open Source Software, 4(43), 1686.
]

---

## Pipe operator (`magrittr`)

> I want to find my keys, then start my car, then drive to work, then park my car.

- Nested

```r
park(drive(start_car(find("keys")), to = "work"))
```

- **Piped**

```r
find("keys") %>%
  start_car() %>%
  drive(to = "work") %>%
  park()
```

---

## `magrittr` vs native pipe

As of R 4.1.0 there is now a native pipe operator in R (`|>`) which is very similar to magrittr's (`%>%`).

For teaching purposes we would strongly recommend using magrittr for the foreseeable future.

- `|>` only supports piping to the first argument (no support for `.`)

- For most use cases, package dependencies are easier than R version dependencies

---

# Why tidyverse?

---

## Recoding a binary variable

```r
mtcars$transmission <-
  ifelse(
    mtcars$am == 0,
    "automatic",
    "manual"
  )
```
]
.pull-right[
### Tidyverse

```r
mtcars <- mtcars %>%
  mutate(
    transmission = case_when(
      am == 0 ~ "automatic",
      am == 1 ~ "manual"
    )
  )
```
]

---

## Recoding a multi-level variable

```r
mtcars$gear_char <-
  ifelse(
    mtcars$gear == 3,
    "three",
    ifelse(
      mtcars$gear == 4,
      "four",
      "five"
    )
  )
```
]
.pull-right[
### Tidyverse

```r
mtcars <- mtcars %>%
  mutate(
    gear_char = case_when(
      gear == 3 ~ "three",
      gear == 4 ~ "four",
      gear == 5 ~ "five"
    )
  )
```
]

---

## Visualising multiple variables

### Tidyverse

```r
ggplot(
  mtcars,
  aes(x = disp, y = mpg, color = transmission)
) +
  geom_point()
```

<img src="02-teach-tidyverse_files/figure-html/unnamed-chunk-9-1.png" width="100%" />
]

---

## Visualising even more variables

### Tidyverse

```r
ggplot(
  mtcars,
  aes(x = disp, y = mpg, color = transmission)
) +
  geom_point() +
  facet_wrap(~ cyl)
```

<img src="02-teach-tidyverse_files/figure-html/unnamed-chunk-10-1.png" width="100%" />
]

---

### Base R

```r
mtcars$trans_color <- ifelse(mtcars$transmission == "automatic", "green", "blue")
mtcars_cyl4 = mtcars[mtcars$cyl == 4, ]
mtcars_cyl6 = mtcars[mtcars$cyl == 6, ]
mtcars_cyl8 = mtcars[mtcars$cyl == 8, ]
par(mfrow = c(1, 3), mar = c(2.5, 2.5, 2, 0), mgp = c(1.5, 0.5, 0))
plot(mpg ~ disp, data = mtcars_cyl4, col = trans_color, main = "Cyl 4")
plot(mpg ~ disp, data = mtcars_cyl6, col = trans_color, main = "Cyl 6")
plot(mpg ~ disp, data = mtcars_cyl8, col = trans_color, main = "Cyl 8")
legend("topright", legend = c("automatic", "manual"), pch = 1, col = c("green", "blue"))
```

<img src="02-teach-tidyverse_files/figure-html/unnamed-chunk-11-1.png" width="100%" />
]

---

## Benefits of starting with the tidyverse

- More (human) readable syntax

- More consistent syntax

- Ease of multivariate visualizations

- Data tidying/rectangling without advanced programming

- Growth opportunities:
  - dplyr -> SQL / Spark / etc
  - purrr -> functional programming
  - modeling -> tidymodels

---

# How tidyverse?

---

.discussion[
How do you start your lessons? Why?
- `library(tidyverse)` 
- `library(ggplot2)`, `library(dplyr)`, etc.
]

---

### .pink[ Sample slide ]

## ggplot2 `$\in$` tidyverse

.pull-left[
<img src="img/ggplot2-part-of-tidyverse.png" width="80%" />
]
.pull-right[
- **ggplot2** is tidyverse's data visualization package
- The `gg` in "ggplot2" stands for Grammar of Graphics
- It is inspired by the book **Grammar of Graphics** by Leland Wilkinson
]

---

# Start with ggplot2

---

## Why start with ggplot2?

1. Students come in with intuition for being able to interpret data visualizations without needing much instructions.

- Focus the majority of class time initially on syntax and leave interpretations to students. 
  - Later on the scale tips -- spend more class time on concepts and results interpretations and less on syntax.

1. It can be easier for students to detect mistakes in visualizations compared to those in wrangling or modeling.

---

## What next?

It depends on the course and subject matter, but generally data munging with dplyr is a good next step.

Some general guidance,
- Start with a small subset of verbs (e.g. `select()`, `filter()`, `mutate()`)

- Aim to quickly get to `group_by()` and `summarize()` as this is where the action is.

- Connecting munging back to data visualization tends to be more motivating than generating numerical summaries.

- Data cleaning provides opportunities to introduce additional packages (e.g. `stringr`, `forcats`)

---

# Teaching the tidyverse in 2021

---

# Reshaping data

---

## Instructional staff employment trends

The American Association of University Professors (AAUP) is a nonprofit membership association of faculty and other academic professionals. [This report](https://www.aaup.org/sites/default/files/files/AAUP_Report_InstrStaff-75-11_apr2013.pdf) by the AAUP shows trends in instructional staff employees between 1975 and 2011, and contains an image very similar to the one given below.

---

## Data

Each row in this dataset represents a faculty type, and the columns are the 
years for which we have data. The values are percentage of hires of that type 
of faculty for each year.

```r
(staff <- read_csv("data/instructional-staff.csv"))
```

```
## # A tibble: 5 x 12
##   faculty_type    `1975` `1989` `1993` `1995` `1999` `2001` `2003` `2005` `2007`
##   <chr>            <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Full-Time Tenu…   29     27.6   25     24.8   21.8   20.3   19.3   17.8   17.2
## 2 Full-Time Tenu…   16.1   11.4   10.2    9.6    8.9    9.2    8.8    8.2    8  
## 3 Full-Time Non-…   10.3   14.1   13.6   13.6   15.2   15.5   15     14.8   14.9
## 4 Part-Time Facu…   24     30.4   33.1   33.2   35.5   36     37     39.3   40.5
## 5 Graduate Stude…   20.5   16.5   18.1   18.8   18.7   19     20     19.9   19.5
## # … with 2 more variables: 2009 <dbl>, 2011 <dbl>
```
]

---

## Recreate the visualization

- In order to recreate this visualization we need to first reshape the data:
  - one variable for faculty type 
  - one variable for year
  
- Convert the data from the wide format to long format

- `gather()`/`spread()`
- `pivot_wider()`/ `pivot_longer()`
- Something else?
]

---

---

## `pivot_*()` functions

---

But before we do so...

**Question:** If the long data will have a row for each year/faculty type combination, and there are 5 faculty types and 11 years of data, how many rows will the data have?

---

## Pivot staff data

```r
(staff_long <- staff %>%
  pivot_longer(
    cols = -faculty_type,    # columns to pivot
    names_to = "year",       # name of new column for variable names
    values_to = "percentage" # name of new column for values
  ) %>%
  mutate(
    percentage = as.numeric(percentage)
  )
)
```

```
## # A tibble: 55 x 3
##    faculty_type              year  percentage
##    <chr>                     <chr>      <dbl>
##  1 Full-Time Tenured Faculty 1975        29  
##  2 Full-Time Tenured Faculty 1989        27.6
##  3 Full-Time Tenured Faculty 1993        25  
##  4 Full-Time Tenured Faculty 1995        24.8
##  5 Full-Time Tenured Faculty 1999        21.8
##  6 Full-Time Tenured Faculty 2001        20.3
##  7 Full-Time Tenured Faculty 2003        19.3
##  8 Full-Time Tenured Faculty 2005        17.8
##  9 Full-Time Tenured Faculty 2007        17.2
## 10 Full-Time Tenured Faculty 2009        16.8
## # … with 45 more rows
```
]

---

## Meh

```r
ggplot(staff_long, aes(x = percentage, y = year, fill = faculty_type)) +
  geom_col(position = "dodge")
```

![](02-teach-tidyverse_files/figure-html/unnamed-chunk-17-1.png)
]

---

## Some improvement...

```r
ggplot(staff_long, aes(x = percentage, y = year, fill = faculty_type)) +
  geom_col()
```

![](02-teach-tidyverse_files/figure-html/unnamed-chunk-18-1.png)
]

---

## More improvement

![](02-teach-tidyverse_files/figure-html/staff-lines-2-1.png)

---

```r
staff_long %>%
  mutate( 
    part_time = if_else(faculty_type == "Part-Time Faculty",
                        "Part-Time Faculty", "Other Faculty"),
    year = as.numeric(year)
  ) %>% 
  ggplot(
    aes(x = year, y = percentage/100, group = faculty_type, color = part_time)
  ) +
  geom_line() +
  scale_color_manual(values = c("gray", "red")) + 
  scale_y_continuous(labels = label_percent(accuracy = 1)) + 
  theme_minimal() +
  labs(
    title = "Instructional staff employment trends",
    x = "Year", y = "Percentage", color = NULL
  ) +
  theme(legend.position = "bottom")
```
]

---

# When to purrr?

---

## Data manipulation with purrr (or not?)

- purrr is a package for functional programming with the tidyverse

- If you picked up the tidyverse >2 years ago, purrr was commonly used for data science tasks that involve iteration

- In 2021, it's possible to do many of these data science tasks with dplyr and tidyr, these approaches are often more approachable to new learners

.discussion[
How familiar are you with the purrr package? Have you taught purrr in your data science courses?
]

---

## Ex 1. Flattening JSON files

We have data on lego sales and some information on the buyers in JSON format. We want to covert it into a tidy data frame.

```r
sales <- read_rds("data/lego_sales.rds")
jsonlite::toJSON(sales[1], pretty = TRUE)
```

```
## [
##   {
##     "gender": ["Female"],
##     "first_name": ["Kimberly"],
##     "last_name": ["Beckstead"],
##     "age": [24],
##     "phone_number": ["216-555-2549"],
##     "purchases": [
##       {
##         "SetID": [24701],
##         "Number": ["76062"],
##         "Theme": ["DC Comics Super Heroes"],
##         "Subtheme": ["Mighty Micros"],
##         "Year": [2016],
##         "Name": ["Robin vs. Bane"],
##         "Pieces": [77],
##         "USPrice": [9.99],
##         "ImageURL": ["http://images.brickset.com/sets/images/76062-1.jpg"],
##         "Quantity": [1]
##       }
##     ]
##   }
## ]
```
]

---

## purrr solution

```r
sales %>%
  purrr::map_dfr(
    function(l) {
      purchases <- purrr::map_dfr(l$purchases, ~.)
      l$purchases <- NULL
      
      bind_cols(as_tibble(l), purchases)
    }
  )
```
]

---

## purr solution

```
## # A tibble: 620 x 15
##    gender first_name last_name    age phone_number SetID Number Theme  Subtheme 
##    <chr>  <chr>      <chr>      <dbl> <chr>        <int> <chr>  <chr>  <chr>    
##  1 Female Kimberly   Beckstead     24 216-555-2549 24701 76062  DC Co… "Mighty …
##  2 Male   Neel       Garvin        35 819-555-3189 25626 70595  Ninja… "Rise of…
##  3 Male   Neel       Garvin        35 819-555-3189 24665 21031  Archi… ""       
##  4 Female Chelsea    Bouchard      41 <NA>         24695 31048  Creat… ""       
##  5 Female Chelsea    Bouchard      41 <NA>         25626 70595  Ninja… "Rise of…
##  6 Female Chelsea    Bouchard      41 <NA>         24721 10831  Duplo  ""       
##  7 Female Bryanna    Welsh         19 <NA>         24797 75138  Star … "Episode…
##  8 Female Bryanna    Welsh         19 <NA>         24701 76062  DC Co… "Mighty …
##  9 Male   Caleb      Garcia-Wi…    37 907-555-9236 24730 41115  Frien… ""       
## 10 Male   Caleb      Garcia-Wi…    37 907-555-9236 25611 21127  Minec… "Minifig…
## # … with 610 more rows, and 6 more variables: Year <int>, Name <chr>,
## #   Pieces <int>, USPrice <dbl>, ImageURL <chr>, Quantity <dbl>
```
]

---

## A tidyr solution

```r
tibble(sales = sales) %>%
  unnest_wider(sales) %>%
  unnest_longer(purchases) %>%
  unnest_wider(purchases)
```

---

## tidyr solution (Step 1)

```r
tibble(sales = sales)
```

```
## # A tibble: 250 x 1
##    sales           
##    <list>          
##  1 <named list [6]>
##  2 <named list [6]>
##  3 <named list [5]>
##  4 <named list [5]>
##  5 <named list [6]>
##  6 <named list [6]>
##  7 <named list [6]>
##  8 <named list [6]>
##  9 <named list [6]>
## 10 <named list [6]>
## # … with 240 more rows
```
]

---

## tidyr solution (Step 2)

```r
tibble(sales = sales) %>%
  unnest_wider(sales)
```

```
## # A tibble: 250 x 6
##    gender first_name last_name        age phone_number purchases 
##    <chr>  <chr>      <chr>          <dbl> <chr>        <list>    
##  1 Female Kimberly   Beckstead         24 216-555-2549 <list [1]>
##  2 Male   Neel       Garvin            35 819-555-3189 <list [2]>
##  3 Female Chelsea    Bouchard          41 <NA>         <list [3]>
##  4 Female Bryanna    Welsh             19 <NA>         <list [2]>
##  5 Male   Caleb      Garcia-Wideman    37 907-555-9236 <list [2]>
##  6 Male   Chase      Fortenberry       19 205-555-3704 <list [2]>
##  7 Male   Kevin      Cruz              20 947-555-7946 <list [1]>
##  8 Male   Connor     Brown             36 516-555-4310 <list [3]>
##  9 Female Toni       Borison           40 284-555-4560 <list [2]>
## 10 Male   Daniel     Hurst             44 251-555-0845 <list [2]>
## # … with 240 more rows
```
]

---

## tidyr solution (Step 3)

```r
tibble(sales = sales) %>%
  unnest_wider(sales) %>%
  unnest_longer(purchases)
```

```
## # A tibble: 620 x 6
##    gender first_name last_name        age phone_number purchases        
##    <chr>  <chr>      <chr>          <dbl> <chr>        <list>           
##  1 Female Kimberly   Beckstead         24 216-555-2549 <named list [10]>
##  2 Male   Neel       Garvin            35 819-555-3189 <named list [10]>
##  3 Male   Neel       Garvin            35 819-555-3189 <named list [10]>
##  4 Female Chelsea    Bouchard          41 <NA>         <named list [10]>
##  5 Female Chelsea    Bouchard          41 <NA>         <named list [10]>
##  6 Female Chelsea    Bouchard          41 <NA>         <named list [10]>
##  7 Female Bryanna    Welsh             19 <NA>         <named list [10]>
##  8 Female Bryanna    Welsh             19 <NA>         <named list [10]>
##  9 Male   Caleb      Garcia-Wideman    37 907-555-9236 <named list [10]>
## 10 Male   Caleb      Garcia-Wideman    37 907-555-9236 <named list [10]>
## # … with 610 more rows
```
]

---

## tidyr solution (Step 4)

```r
tibble(sales = sales) %>%
  unnest_wider(sales) %>%
  unnest_longer(purchases) %>%
  unnest_wider(purchases)
```

---

## dplyr improvements

Another common use case for purrr has been working across rows and/or columns of a data frames.

Much of this functionality is now available directly in dplyr via the `across()` and `rowwise()` functions. Additional details and examples are availble in the vignettes:
- [column-wise operations vignette](https://dplyr.tidyverse.org/articles/colwise.html)
- [row-wise operations vignette](https://dplyr.tidyverse.org/articles/rowwise.html)

and the dplyr 1.0.0 release blog posts:

- [working across columns](https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/)
- [working within rows](https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-rowwise/)

---

# Resources

---

## Recommended reading

- Keep up to date with the [tidyverse blog](https://www.tidyverse.org/blog/) *for packages you teach*

- Four part blog series: Teaching the Tidyverse from 2020
  - [Part 1: Getting started](https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-1-getting-started/)
  - [Part 2: Data visualisation](https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-2-data-visualisation/)
  - [Part 3: Data wrangling and tidying](https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-3-data-wrangling-and-tidying/)
  - [Part 4: When to purrr?](https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-4-when-to-purrr/)

---

## The larger tidy ecosystem

- [janitor](https://garthtarr.github.io/meatR/janitor.html)

- [kableExtra](https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html)

- [patchwork](https://patchwork.data-imaginist.com/)

- [gghighlight](https://cran.r-project.org/web/packages/gghighlight/vignettes/gghighlight.html)

- [tidybayes](https://mjskay.github.io/tidybayes/)