welcoming learners

to data science

with the tidyverse

mine çetinkaya-rundel
duke university + posit

bit.ly/tidyperspective-dds

@minebocek

introduction

setting the scene

about me

Female teacher icon

Focus:

Data science for new learners

Cake icon

Philosophy:

Let them eat cake (first)!

about data science education

Code icon

Assumption 1:

Teach authentic tools

Code icon with R logo

Assumption 2:

Teach R as the authentic tool

takeaway

The tidyverse provides an effective and efficient pathway for undergraduate students at all levels and majors to gain computational skills and thinking needed throughout the data science cycle.

and that pathway starts with…

Introduction to Data Science

sta199-f22-1.github.io

List of topics in STA 199: Hello world, Exploring data (visualize, wrangle, import), Data science ethics (misrepresentation, data privacy, algorithmic bias), Making rigorous conclusions (model, predict, infer), Looking further.

principles of the tidyverse

tidyverse

meta R package that loads eight core packages when invoked and also bundles numerous other packages that share a design philosophy, common grammar, and data structures

library(tidyverse)

── Attaching packages ────────────────────-─ tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2

Data science cycle: import, tidy, transform, visualize, model, communicate. Packages readr and tibble are for import. Packages tidyr and purr for tidy and transform. Packages dplyr, stringr, forcats, and lubridate are for transform. Package ggplot2 is for visualize.

examples: two “simple” tasks

grouped summary statistics:

Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.

Homeownership	Average loan amount	Number of applicants
Mortgage	$18,132	4,778
Own	$15,665	1,350
Rent	$14,396	3,848

multivariable data visualizations:

Create side-by-side box plots that show the relationship between loan amount and application type based on homeownership.

teaching with the tidyverse

task 1 - step 1

Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.

loans

# A tibble: 9,976 × 3
  loan_amount homeownership application_type
        <int> <chr>         <fct>           
1       28000 Mortgage      individual      
2        5000 Rent          individual      
3        2000 Rent          individual      
4       21600 Rent          individual      
5       23000 Rent          joint           
6        5000 Own           individual      
# … with 9,970 more rows

task 1 - step 2

Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.

loans |>
  group_by(homeownership)

# A tibble: 9,976 × 3
# Groups:   homeownership [3]
  loan_amount homeownership application_type
        <int> <chr>         <fct>           
1       28000 Mortgage      individual      
2        5000 Rent          individual      
3        2000 Rent          individual      
4       21600 Rent          individual      
5       23000 Rent          joint           
6        5000 Own           individual      
# … with 9,970 more rows

task 1 - step 3

Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.

loans |>
  group_by(homeownership) |> 
  summarize(
    avg_loan_amount = mean(loan_amount)
    )

# A tibble: 3 × 2
  homeownership avg_loan_amount
  <chr>                   <dbl>
1 Mortgage               18132.
2 Own                    15665.
3 Rent                   14396.

task 1 - step 4

Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.

loans |>
  group_by(homeownership) |> 
  summarize(
    avg_loan_amount = mean(loan_amount),
    n_applicants = n()
    )

# A tibble: 3 × 3
  homeownership avg_loan_amount n_applicants
  <chr>                   <dbl>        <int>
1 Mortgage               18132.         4778
2 Own                    15665.         1350
3 Rent                   14396.         3848

task 1 - step 5

Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.

loans |>
  group_by(homeownership) |> 
  summarize(
    avg_loan_amount = mean(loan_amount),
    n_applicants = n()
    ) |>
  arrange(desc(avg_loan_amount))

# A tibble: 3 × 3
  homeownership avg_loan_amount n_applicants
  <chr>                   <dbl>        <int>
1 Mortgage               18132.         4778
2 Own                    15665.         1350
3 Rent                   14396.         3848

task 1 with the tidyverse

[input] data frame

loans |>
  group_by(homeownership) |> 
  summarize(
    avg_loan_amount = mean(loan_amount),
    n_applicants = n()
    ) |>
  arrange(desc(avg_loan_amount))

# A tibble: 3 × 3
  homeownership avg_loan_amount n_applicants
  <chr>                   <dbl>        <int>
1 Mortgage               18132.         4778
2 Own                    15665.         1350
3 Rent                   14396.         3848

[output] data frame

always start with a data frame and end with a data frame
variables are always accessed from within data frames
more verbose (than some other approaches), but also more expressive and extensible

task 1 with `aggregate()`

ns <- aggregate(
  loan_amount ~ homeownership, 
  data = loans, FUN = length
  )
names(ns)[2] <- "n_applicants"

avgs <- aggregate(
  loan_amount ~ homeownership, 
  data = loans, FUN = mean
  )
names(avgs)[2] <- "avg_loan_amount"

result <- merge(ns, avgs)
result[order(result$avg_loan_amount, 
             decreasing = TRUE), ]

  homeownership n_applicants avg_loan_amount
1      Mortgage         4778        18132.45
2           Own         1350        15665.44
3          Rent         3848        14396.44

task 1 with `aggregate()`

ns <- aggregate(
  loan_amount ~ homeownership, 
  data = loans, FUN = length
  )
names(ns)[2] <- "n_applicants"

avgs <- aggregate(
  loan_amount ~ homeownership, 
  data = loans, FUN = mean
  )
names(avgs)[2] <- "avg_loan_amount"

result <- merge(ns, avgs)
result[order(result$avg_loan_amount, 
             decreasing = TRUE), ]

  homeownership n_applicants avg_loan_amount
1      Mortgage         4778        18132.45
2           Own         1350        15665.44
3          Rent         3848        14396.44

challenges: need to introduce

formula syntax
passing functions as arguments
merging datasets
square bracket notation for accessing rows

task 1 with `tapply()`

sort(
  tapply(loans$loan_amount, 
         loans$homeownership, 
         mean),
  decreasing = TRUE
  )

Mortgage      Own     Rent 
18132.45 15665.44 14396.44

task 1 with `tapply()`

sort(
  tapply(loans$loan_amount, 
         loans$homeownership, 
         mean),
  decreasing = TRUE
  )

Mortgage      Own     Rent 
18132.45 15665.44 14396.44

challenges: need to introduce

passing functions as arguments
distinguishing between the various apply() functions
ending up with a new data structure (array)
reading nested functions

task 2 - step 1

Create side-by-side box plots that shows the relationship between annual income and application type, faceted by homeownership.

ggplot(loans)

task 2 - step 2

Create side-by-side box plots that shows the relationship between annual income and application type, faceted by homeownership.

ggplot(loans, 
       aes(x = application_type))

task 2 - step 3

Create side-by-side box plots that shows the relationship between annual income and application type, faceted by homeownership.

ggplot(loans, 
       aes(x = application_type,
           y = loan_amount))

task 2 - step 4

Create side-by-side box plots that shows the relationship between annual income and application type, faceted by homeownership.

ggplot(loans, 
       aes(x = application_type,
           y = loan_amount)) +
  geom_boxplot()

task 2 - step 5

Create side-by-side box plots that shows the relationship between annual income and application type, faceted by homeownership.

ggplot(loans, 
       aes(x = application_type,
           y = loan_amount)) +
  geom_boxplot() +
  facet_wrap(~ homeownership)

task 2 with `boxplot()`

levels <- sort(unique(loans$homeownership))

loans1 <- loans[loans$homeownership == levels[1],]
loans2 <- loans[loans$homeownership == levels[2],]
loans3 <- loans[loans$homeownership == levels[3],]

par(mfrow = c(1, 3))

boxplot(loan_amount ~ application_type, 
        data = loans1, main = levels[1])
boxplot(loan_amount ~ application_type, 
        data = loans2, main = levels[2])
boxplot(loan_amount ~ application_type, 
        data = loans3, main = levels[3])

task 2 with `boxplot()`

we could keep going, but…

tools designed for specific tasks vs. general tools

On one side Lego city sets, on the other size a lego base plate and loose classic Lego pieces.

final thoughts

pedagogical strengths of the tidyverse

Concept	Description
Consistency	Syntax, function interfaces, argument names, and orders follow patterns
Mixability	Ability to use base R and other functions within the tidyverse
Scalability	Unified approach to data wrangling and visualization works for datasets of a wide range of types and sizes
User-centered design	Function interfaces designed and improved with users in mind
Readability	Interfaces that are designed to produce readable code
Community	Large, active, welcoming community of users and resources
Transfarability	Data manipulation verbs inherit from SQL’s query syntax

keeping up with the tidyverse

Blog posts highlight updates, along with the reasoning behind them and worked examples
Lifecycle stages and badges

building a curriculum

Start with library(tidyverse)
Teach by learning goals, not packages

the curriculum we’ve built @ duke statsci

STA 199: Introduction to Data Science
courses:
- STA 198: Introduction to Global Health Data Science
- STA 210: Regression Analysis
- STA 323: Statistical Computing
- STA 440: Case Studies
programs:
- Inter-departmental major in Data Science (with CS)
- Data Science concentration for the StatSci major
and more…

learn / teach the tidyverse

learn the tidyverse

tidyverse.org

Tidyverse hex logo

teach the tidyverse

datasciencebox.org

Data science in a box hex logo

thank you!

bit.ly/tidyperspective-dds

introduction

setting the scene

about me

about data science education

takeaway

and that pathway starts with…

principles of the tidyverse

tidyverse

examples: two “simple” tasks

grouped summary statistics:

multivariable data visualizations:

teaching with the tidyverse

task 1 - step 1

task 1 - step 2

task 1 - step 3

task 1 - step 4

task 1 - step 5

task 1 with the tidyverse

task 1 with aggregate()

task 1 with aggregate()

task 1 with tapply()

task 1 with tapply()

task 2 - step 1

task 2 - step 2

task 2 - step 3

task 2 - step 4

task 2 - step 5

task 2 with boxplot()

task 2 with boxplot()

we could keep going, but…

tools designed for specific tasks vs. general tools

final thoughts

pedagogical strengths of the tidyverse

keeping up with the tidyverse

building a curriculum

the curriculum we’ve built @ duke statsci

learn / teach the tidyverse

further reading

thank you!

task 1 with `aggregate()`

task 1 with `aggregate()`

task 1 with `tapply()`

task 1 with `tapply()`

task 2 with `boxplot()`

task 2 with `boxplot()`