welcoming learners

to data science

with the tidyverse

mine çetinkaya-rundel
duke university + posit

bit.ly/tidyperspective-dds

introduction

setting the scene

about me


Female teacher icon

Focus:

Data science for new learners


Cake icon

Philosophy:

Let them eat cake (first)!

about data science education


Code icon

Assumption 1:

Teach authentic tools


Code icon with R logo

Assumption 2:

Teach R as the authentic tool

takeaway



The tidyverse provides an effective and efficient pathway for undergraduate students at all levels and majors to gain computational skills and thinking needed throughout the data science cycle.

and that pathway starts with…


Introduction to Data Science

sta199-f22-1.github.io

List of topics in STA 199: Hello world, Exploring data (visualize, wrangle, import), Data science ethics (misrepresentation, data privacy, algorithmic bias), Making rigorous conclusions (model, predict, infer), Looking further.

principles of the tidyverse

tidyverse

meta R package that loads eight core packages when invoked and also bundles numerous other packages that share a design philosophy, common grammar, and data structures

library(tidyverse)
── Attaching packages ────────────────────-─ tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2

Tidyverse hex icon

Data science cycle: import, tidy, transform, visualize, model, communicate. Packages readr and tibble are for import. Packages tidyr and purr for tidy and transform. Packages dplyr, stringr, forcats, and lubridate are for transform. Package ggplot2 is for visualize.

examples: two “simple” tasks

grouped summary statistics:

Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.

Homeownership Average loan amount Number of applicants
Mortgage $18,132 4,778
Own $15,665 1,350
Rent $14,396 3,848

multivariable data visualizations:

Create side-by-side box plots that show the relationship between loan amount and application type based on homeownership.

teaching with the tidyverse

task 1 - step 1

Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.

loans
# A tibble: 9,976 × 3
  loan_amount homeownership application_type
        <int> <chr>         <fct>           
1       28000 Mortgage      individual      
2        5000 Rent          individual      
3        2000 Rent          individual      
4       21600 Rent          individual      
5       23000 Rent          joint           
6        5000 Own           individual      
# … with 9,970 more rows

task 1 - step 2

Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.

loans |>
  group_by(homeownership)
# A tibble: 9,976 × 3
# Groups:   homeownership [3]
  loan_amount homeownership application_type
        <int> <chr>         <fct>           
1       28000 Mortgage      individual      
2        5000 Rent          individual      
3        2000 Rent          individual      
4       21600 Rent          individual      
5       23000 Rent          joint           
6        5000 Own           individual      
# … with 9,970 more rows

task 1 - step 3

Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.

loans |>
  group_by(homeownership) |> 
  summarize(
    avg_loan_amount = mean(loan_amount)
    )
# A tibble: 3 × 2
  homeownership avg_loan_amount
  <chr>                   <dbl>
1 Mortgage               18132.
2 Own                    15665.
3 Rent                   14396.

task 1 - step 4

Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.

loans |>
  group_by(homeownership) |> 
  summarize(
    avg_loan_amount = mean(loan_amount),
    n_applicants = n()
    )
# A tibble: 3 × 3
  homeownership avg_loan_amount n_applicants
  <chr>                   <dbl>        <int>
1 Mortgage               18132.         4778
2 Own                    15665.         1350
3 Rent                   14396.         3848

task 1 - step 5

Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.

loans |>
  group_by(homeownership) |> 
  summarize(
    avg_loan_amount = mean(loan_amount),
    n_applicants = n()
    ) |>
  arrange(desc(avg_loan_amount))
# A tibble: 3 × 3
  homeownership avg_loan_amount n_applicants
  <chr>                   <dbl>        <int>
1 Mortgage               18132.         4778
2 Own                    15665.         1350
3 Rent                   14396.         3848

task 1 with the tidyverse

[input] data frame

loans |>
  group_by(homeownership) |> 
  summarize(
    avg_loan_amount = mean(loan_amount),
    n_applicants = n()
    ) |>
  arrange(desc(avg_loan_amount))
# A tibble: 3 × 3
  homeownership avg_loan_amount n_applicants
  <chr>                   <dbl>        <int>
1 Mortgage               18132.         4778
2 Own                    15665.         1350
3 Rent                   14396.         3848

[output] data frame

  • always start with a data frame and end with a data frame
  • variables are always accessed from within data frames
  • more verbose (than some other approaches), but also more expressive and extensible

task 1 with aggregate()

ns <- aggregate(
  loan_amount ~ homeownership, 
  data = loans, FUN = length
  )
names(ns)[2] <- "n_applicants"

avgs <- aggregate(
  loan_amount ~ homeownership, 
  data = loans, FUN = mean
  )
names(avgs)[2] <- "avg_loan_amount"

result <- merge(ns, avgs)
result[order(result$avg_loan_amount, 
             decreasing = TRUE), ]
  homeownership n_applicants avg_loan_amount
1      Mortgage         4778        18132.45
2           Own         1350        15665.44
3          Rent         3848        14396.44

task 1 with aggregate()

ns <- aggregate(
  loan_amount ~ homeownership, 
  data = loans, FUN = length
  )
names(ns)[2] <- "n_applicants"

avgs <- aggregate(
  loan_amount ~ homeownership, 
  data = loans, FUN = mean
  )
names(avgs)[2] <- "avg_loan_amount"

result <- merge(ns, avgs)
result[order(result$avg_loan_amount, 
             decreasing = TRUE), ]
  homeownership n_applicants avg_loan_amount
1      Mortgage         4778        18132.45
2           Own         1350        15665.44
3          Rent         3848        14396.44

challenges: need to introduce

  • formula syntax
  • passing functions as arguments
  • merging datasets
  • square bracket notation for accessing rows

task 1 with tapply()

sort(
  tapply(loans$loan_amount, 
         loans$homeownership, 
         mean),
  decreasing = TRUE
  )
Mortgage      Own     Rent 
18132.45 15665.44 14396.44 

task 1 with tapply()

sort(
  tapply(loans$loan_amount, 
         loans$homeownership, 
         mean),
  decreasing = TRUE
  )
Mortgage      Own     Rent 
18132.45 15665.44 14396.44 

challenges: need to introduce

  • passing functions as arguments
  • distinguishing between the various apply() functions
  • ending up with a new data structure (array)
  • reading nested functions

task 2 - step 1

Create side-by-side box plots that shows the relationship between annual income and application type, faceted by homeownership.

ggplot(loans)

task 2 - step 2

Create side-by-side box plots that shows the relationship between annual income and application type, faceted by homeownership.

ggplot(loans, 
       aes(x = application_type))

task 2 - step 3

Create side-by-side box plots that shows the relationship between annual income and application type, faceted by homeownership.

ggplot(loans, 
       aes(x = application_type,
           y = loan_amount))

task 2 - step 4

Create side-by-side box plots that shows the relationship between annual income and application type, faceted by homeownership.

ggplot(loans, 
       aes(x = application_type,
           y = loan_amount)) +
  geom_boxplot()

task 2 - step 5

Create side-by-side box plots that shows the relationship between annual income and application type, faceted by homeownership.

ggplot(loans, 
       aes(x = application_type,
           y = loan_amount)) +
  geom_boxplot() +
  facet_wrap(~ homeownership)

task 2 with boxplot()

levels <- sort(unique(loans$homeownership))

loans1 <- loans[loans$homeownership == levels[1],]
loans2 <- loans[loans$homeownership == levels[2],]
loans3 <- loans[loans$homeownership == levels[3],]

par(mfrow = c(1, 3))

boxplot(loan_amount ~ application_type, 
        data = loans1, main = levels[1])
boxplot(loan_amount ~ application_type, 
        data = loans2, main = levels[2])
boxplot(loan_amount ~ application_type, 
        data = loans3, main = levels[3])

task 2 with boxplot()

we could keep going, but…

tools designed for specific tasks vs. general tools

On one side Lego city sets, on the other size a lego base plate and loose classic Lego pieces.

final thoughts

pedagogical strengths of the tidyverse

Concept Description
Consistency Syntax, function interfaces, argument names, and orders follow patterns
Mixability Ability to use base R and other functions within the tidyverse
Scalability Unified approach to data wrangling and visualization works for datasets of a wide range of types and sizes
User-centered design Function interfaces designed and improved with users in mind
Readability Interfaces that are designed to produce readable code
Community Large, active, welcoming community of users and resources
Transfarability Data manipulation verbs inherit from SQL’s query syntax

keeping up with the tidyverse

  • Blog posts highlight updates, along with the reasoning behind them and worked examples

  • Lifecycle stages and badges

    Lifecycle stages of tidyverse functions and packages: experimental, stable, deprecated, superseded.

building a curriculum

the curriculum we’ve built @ duke statsci

  • STA 199: Introduction to Data Science

  • courses:

    • STA 198: Introduction to Global Health Data Science
    • STA 210: Regression Analysis
    • STA 323: Statistical Computing
    • STA 440: Case Studies
  • programs:

    • Inter-departmental major in Data Science (with CS)
    • Data Science concentration for the StatSci major
  • and more…

learn / teach the tidyverse

learn the tidyverse

tidyverse.org

Tidyverse hex logo

teach the tidyverse

datasciencebox.org

Data science in a box hex logo

further reading

+ collaborators

  • Johanna Hardin, Pomona College
  • Benjamin S. Baumer, Smith College
  • Amelia McNamara, University of St Thomas
  • Nicholas J. Horton, Amherst College
  • Colin W. Rundel, Duke University

Screenshot of the paper titled "An educator's perspective of the tidyverse" from the journal (TISE) website. Shows the title of the paper, the names and affiliations of authors, and part of the abstract.

thank you!

bit.ly/tidyperspective-dds