Assumption 1:
Teach authentic tools
Assumption 2:
Teach R as the authentic tool
The tidyverse provides an effective and efficient pathway for undergraduate students at all levels and majors to gain computational skills and thinking needed throughout the data science cycle.
Data: Thousands of loans made through the Lending Club, a peer-to-peer lending platform available in the openintro package, with a few modifications.
library(tidyverse)
library(openintro)
loans <- loans_full_schema %>%
mutate(
homeownership = str_to_title(homeownership),
bankruptcy = if_else(public_record_bankrupt >= 1, "Yes", "No")
) %>%
filter(annual_income >= 10) %>%
select(
loan_amount, homeownership, bankruptcy,
application_type, annual_income, interest_rate
)
# A tibble: 9,976 × 6
loan_amount homeownership bankruptcy application_type annual_income interest…¹
<int> <chr> <chr> <fct> <dbl> <dbl>
1 28000 Mortgage No individual 90000 14.1
2 5000 Rent Yes individual 40000 12.6
3 2000 Rent No individual 40000 17.1
4 21600 Rent No individual 30000 6.72
5 23000 Rent No joint 35000 14.1
6 5000 Own No individual 34000 6.72
# … with 9,970 more rows, and abbreviated variable name ¹interest_rate
Calculate the mean loan amount.
# A tibble: 9,976 × 6
loan_amount homeownership bankruptcy application_type annual_income interest…¹
<int> <chr> <chr> <fct> <dbl> <dbl>
1 28000 Mortgage No individual 90000 14.1
2 5000 Rent Yes individual 40000 12.6
3 2000 Rent No individual 40000 17.1
4 21600 Rent No individual 30000 6.72
5 23000 Rent No joint 35000 14.1
6 5000 Own No individual 34000 6.72
# … with 9,970 more rows, and abbreviated variable name ¹interest_rate
Error in mean(loan_amount): object 'loan_amount' not found
Approach 1: With attach()
:
Not recommended. What if you had another data frame you’re working with concurrently called car_loans
that also had a variable called loan_amount
in it?
Approach 2: Using $
:
Approach 4: The tidyverse approach:
# A tibble: 1 × 1
mean_loan_amount
<dbl>
1 16358.
tidyverse functions take a data
argument that allows them to localize computations inside the specified data frame
does not muddy the concept of what is in the current environment: variables always accessed from within in a data frame without the use of an additional function (like with()
) or quotation marks, never as a vector
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
Homeownership | Number of applicants | Average loan amount |
---|---|---|
Mortgage | $18,132 | 4,778 |
Own | $15,665 | 1,350 |
Rent | $14,396 | 3,848 |
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
# A tibble: 9,976 × 6
loan_amount homeownership bankruptcy application_type annual_income interest…¹
<int> <chr> <chr> <fct> <dbl> <dbl>
1 28000 Mortgage No individual 90000 14.1
2 5000 Rent Yes individual 40000 12.6
3 2000 Rent No individual 40000 17.1
4 21600 Rent No individual 30000 6.72
5 23000 Rent No joint 35000 14.1
6 5000 Own No individual 34000 6.72
# … with 9,970 more rows, and abbreviated variable name ¹interest_rate
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
[input] data frame
# A tibble: 9,976 × 6
# Groups: homeownership [3]
loan_amount homeownership bankruptcy application_type annual_income interest…¹
<int> <chr> <chr> <fct> <dbl> <dbl>
1 28000 Mortgage No individual 90000 14.1
2 5000 Rent Yes individual 40000 12.6
3 2000 Rent No individual 40000 17.1
4 21600 Rent No individual 30000 6.72
5 23000 Rent No joint 35000 14.1
6 5000 Own No individual 34000 6.72
# … with 9,970 more rows, and abbreviated variable name ¹interest_rate
data frame [output]
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
[input] data frame
loans %>%
group_by(homeownership) %>%
summarize(
avg_loan_amount = mean(loan_amount),
n_applicants = n()
) %>%
arrange(desc(avg_loan_amount))
# A tibble: 3 × 3
homeownership avg_loan_amount n_applicants
<chr> <dbl> <int>
1 Mortgage 18132. 4778
2 Own 15665. 1350
3 Rent 14396. 3848
[output] data frame
aggregate()
aggregate()
aggregate()
formula syntax
passing functions as arguments
merging datasets
square bracket notation for accessing rows
tapply()
Mortgage Own Rent
18132.45 15665.44 14396.44
Not so good:
apply()
functionsarray
)many more comparative examples in the paper
We are all converts to the tidyverse and have made a conscious choice to use it in our research and our teaching. We each learned R without the tidyverse and have all spent quite a few years teaching without it at a variety of levels from undergraduate introductory statistics courses to graduate statistical computing courses. This paper is a synthesis of the reasons supporting our tidyverse choice, along with benefits and challenges associated with teaching statistics with the tidyverse.