introduction

collaborators

Johanna Hardin, Pomona College
Benjamin S. Baumer, Smith College
Amelia McNamara, University of St Thomas
Nicholas J. Horton, Amherst College
Colin W. Rundel, Duke University

setting the scene

Assumption 1:

Teach authentic tools

Assumption 2:

Teach R as the authentic tool

takeaway

The tidyverse provides an effective and efficient pathway for undergraduate students at all levels and majors to gain computational skills and thinking needed throughout the data science cycle.

principles of the tidyverse

tidyverse

meta R package that loads eight core packages when invoked and also bundles numerous other packages upon installation
tidyverse packages share a design philosophy, common grammar, and data structures

setup

Data: Thousands of loans made through the Lending Club, a peer-to-peer lending platform available in the openintro package, with a few modifications.

library(tidyverse)
library(openintro)

loans <- loans_full_schema |>
  mutate(
    homeownership = str_to_title(homeownership), 
    bankruptcy = if_else(public_record_bankrupt >= 1, "Yes", "No")
  ) |>
  filter(annual_income >= 10) |>
  select(
    loan_amount, homeownership, bankruptcy,
    application_type, annual_income, interest_rate
  )

start with a data frame

loans

# A tibble: 9,976 × 6
  loan_amount homeownership bankruptcy application_type annual_income interest…¹
        <int> <chr>         <chr>      <fct>                    <dbl>      <dbl>
1       28000 Mortgage      No         individual               90000      14.1 
2        5000 Rent          Yes        individual               40000      12.6 
3        2000 Rent          No         individual               40000      17.1 
4       21600 Rent          No         individual               30000       6.72
5       23000 Rent          No         joint                    35000      14.1 
6        5000 Own           No         individual               34000       6.72
# … with 9,970 more rows, and abbreviated variable name ¹interest_rate

tidy data

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

task: calculate a summary statistic

Calculate the mean loan amount.

loans

# A tibble: 9,976 × 6
  loan_amount homeownership bankruptcy application_type annual_income interest…¹
        <int> <chr>         <chr>      <fct>                    <dbl>      <dbl>
1       28000 Mortgage      No         individual               90000      14.1 
2        5000 Rent          Yes        individual               40000      12.6 
3        2000 Rent          No         individual               40000      17.1 
4       21600 Rent          No         individual               30000       6.72
5       23000 Rent          No         joint                    35000      14.1 
6        5000 Own           No         individual               34000       6.72
# … with 9,970 more rows, and abbreviated variable name ¹interest_rate

mean(loan_amount)

Error in mean(loan_amount): object 'loan_amount' not found

accessing a variable

Approach 1: With attach():

attach(loans)
mean(loan_amount)

[1] 16357.53

Not recommended. What if you had another data frame you’re working with concurrently called car_loans that also had a variable called loan_amount in it?

accessing a variable

Approach 2: Using $:

mean(loans$loan_amount)

[1] 16357.53

Approach 3: Using with():

with(loans, mean(loan_amount))

[1] 16357.53

accessing a variable

Approach 4: The tidyverse approach:

loans |>
  summarise(mean_loan_amount = mean(loan_amount))

# A tibble: 1 × 1
  mean_loan_amount
             <dbl>
1           16358.

More verbose
But also more expressive and extensible