welcoming learners
to data science
with the tidyverse
mine çetinkaya-rundel
duke university + posit
Focus:
Data science for new learners
Philosophy:
Let them eat cake (first)!
Assumption 1:
Teach authentic tools
Assumption 2:
Teach R as the authentic tool
The tidyverse provides an effective and efficient pathway for undergraduate students at all levels and majors to gain computational skills and thinking needed throughout the data science cycle.
Introduction to Data Science
meta R package that loads eight core packages when invoked and also bundles numerous other packages that share a design philosophy, common grammar, and data structures
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
Homeownership | Average loan amount | Number of applicants |
---|---|---|
Mortgage | $18,132 | 4,778 |
Own | $15,665 | 1,350 |
Rent | $14,396 | 3,848 |
Create side-by-side box plots that show the relationship between loan amount and application type based on homeownership.
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
# A tibble: 9,976 × 3
# Groups: homeownership [3]
loan_amount homeownership application_type
<int> <chr> <fct>
1 28000 Mortgage individual
2 5000 Rent individual
3 2000 Rent individual
4 21600 Rent individual
5 23000 Rent joint
6 5000 Own individual
# … with 9,970 more rows
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
Based on the applicants’ home ownership status, compute the average loan amount and the number of applicants. Display the results in descending order of average loan amount.
[input] data frame
loans |>
group_by(homeownership) |>
summarize(
avg_loan_amount = mean(loan_amount),
n_applicants = n()
) |>
arrange(desc(avg_loan_amount))
# A tibble: 3 × 3
homeownership avg_loan_amount n_applicants
<chr> <dbl> <int>
1 Mortgage 18132. 4778
2 Own 15665. 1350
3 Rent 14396. 3848
[output] data frame
aggregate()
ns <- aggregate(
loan_amount ~ homeownership,
data = loans, FUN = length
)
names(ns)[2] <- "n_applicants"
avgs <- aggregate(
loan_amount ~ homeownership,
data = loans, FUN = mean
)
names(avgs)[2] <- "avg_loan_amount"
result <- merge(ns, avgs)
result[order(result$avg_loan_amount,
decreasing = TRUE), ]
homeownership n_applicants avg_loan_amount
1 Mortgage 4778 18132.45
2 Own 1350 15665.44
3 Rent 3848 14396.44
aggregate()
ns <- aggregate(
loan_amount ~ homeownership,
data = loans, FUN = length
)
names(ns)[2] <- "n_applicants"
avgs <- aggregate(
loan_amount ~ homeownership,
data = loans, FUN = mean
)
names(avgs)[2] <- "avg_loan_amount"
result <- merge(ns, avgs)
result[order(result$avg_loan_amount,
decreasing = TRUE), ]
homeownership n_applicants avg_loan_amount
1 Mortgage 4778 18132.45
2 Own 1350 15665.44
3 Rent 3848 14396.44
challenges: need to introduce
tapply()
tapply()
challenges: need to introduce
apply()
functionsarray
)boxplot()
levels <- sort(unique(loans$homeownership))
loans1 <- loans[loans$homeownership == levels[1],]
loans2 <- loans[loans$homeownership == levels[2],]
loans3 <- loans[loans$homeownership == levels[3],]
par(mfrow = c(1, 3))
boxplot(loan_amount ~ application_type,
data = loans1, main = levels[1])
boxplot(loan_amount ~ application_type,
data = loans2, main = levels[2])
boxplot(loan_amount ~ application_type,
data = loans3, main = levels[3])
boxplot()
Concept | Description |
---|---|
Consistency | Syntax, function interfaces, argument names, and orders follow patterns |
Mixability | Ability to use base R and other functions within the tidyverse |
Scalability | Unified approach to data wrangling and visualization works for datasets of a wide range of types and sizes |
User-centered design | Function interfaces designed and improved with users in mind |
Readability | Interfaces that are designed to produce readable code |
Community | Large, active, welcoming community of users and resources |
Transfarability | Data manipulation verbs inherit from SQL’s query syntax |
Blog posts highlight updates, along with the reasoning behind them and worked examples
Lifecycle stages and badges
Start with library(tidyverse)
Teach by learning goals, not packages
STA 199: Introduction to Data Science
courses:
programs:
and more…
learn the tidyverse
teach the tidyverse