class: center, middle, inverse, title-slide # Tidy up your models ##
π
bit.ly/tidymodels-uscots21
###
Mine Γetinkaya-Rundel and Debbie Yuster ### USCOTS 2021 --- layout: true <div class="my-footer"> <span> <a href="https://bit.ly/tidymodels-uscots21" target="_blank">bit.ly/tidymodels-uscots21</a> </span> </div> --- class: middle, inverse ## .larger[.dark-blue[.hand[Welcome]]] --- class: middle <img src="images/logo.png" width="50%" style="display: block; margin: auto;" /> .center[ .large[ One link for all materials π [bit.ly/tidymodels-uscots21](https://bit.ly/tidymodels-uscots21) ]] --- .center[ .pull-left[ <img src="images/mine.png" width="60%" style="display: block; margin: auto;" /> Mine Γetinkaya-Rundel <br> Duke University, RStudio ] .pull-right[ <img src="images/debbie.png" width="60%" style="display: block; margin: auto;" /> Debbie Yuster <br> Ramapo College ] ] .center[ <img src="images/julia.png" width="11%" style="display: block; margin: auto;" /> Julia Silge RStudio ] --- class: middle, inverse ## .larger[.dark-blue[.hand[Hello tidymodels]]] --- class: middle # High level view: tidymodels --- class: middle .pull-left[ <img src="images/hello-tidymodels.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ The **tidymodels** framework is a collection of packages for modeling and machine learning using tidyverse principles. ```r install.packages("tidymodels") ``` ] --- .small[ ```r library(tidymodels) ``` ``` ## Registered S3 method overwritten by 'tune': ## method from ## required_pkgs.model_spec parsnip ``` ``` ## ββ Attaching packages ββββββββββββββββ tidymodels 0.1.3 ββ ``` ``` ## β broom 0.7.8 β rsample 0.1.0 ## β dials 0.0.9 β tune 0.1.5 ## β infer 0.5.4 β workflows 0.2.2 ## β modeldata 0.1.0 β workflowsets 0.0.2 ## β parsnip 0.1.6 β yardstick 0.0.8 ## β recipes 0.1.16 ``` ``` ## ββ Conflicts βββββββββββββββββββ tidymodels_conflicts() ββ ## x recipes::check() masks devtools::check() ## x scales::discard() masks purrr::discard() ## x dplyr::filter() masks stats::filter() ## x recipes::fixed() masks stringr::fixed() ## x purrr::is_null() masks testthat::is_null() ## x dplyr::lag() masks stats::lag() ## x rsample::matches() masks dplyr::matches(), tidyr::matches(), testthat::matches() ## x yardstick::spec() masks readr::spec() ## x recipes::step() masks stats::step() ## β’ Use tidymodels_prefer() to resolve common conflicts. ``` ] --- <img src="images/tidymodels.png" width="95%" style="display: block; margin: auto;" /> --- ## Plan 1. First, let's talk about **syntax** 2. Then, a little bit on **motivation** 3. And finally **pedagogy** --- class: middle # Syntax --- ## Data: Paris Paintings - Source: Printed catalogs of 28 auction sales in Paris, 1764 - 1780 - 3,393 paintings, their prices, and descriptive details from sales catalogs - over 60 variables .small[ ```r pp <- read_csv("data/paris-paintings.csv", na = c("n/a", "", "NA")) pp ``` ``` ## # A tibble: 3,393 x 61 ## name sale lot position dealer year origin_author ## <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> ## 1 L1764-2 L1764 2 0.0328 L 1764 F ## 2 L1764-3 L1764 3 0.0492 L 1764 I ## 3 L1764-4 L1764 4 0.0656 L 1764 X ## 4 L1764-5a L1764 5 0.0820 L 1764 F ## 5 L1764-5b L1764 5 0.0820 L 1764 F ## 6 L1764-6 L1764 6 0.0984 L 1764 X ## # β¦ with 3,387 more rows, and 54 more variables: ## # origin_cat <chr>, school_pntg <chr>, ## # diff_origin <dbl>, logprice <dbl>, price <dbl>, ... ``` ] --- ## Observation: Depart pour la chasse .pull-left[ <img src="images/auction-catalogue.png" width="65%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="images/depart-pour-la-chasse.png" width="58%" style="display: block; margin: auto;" /> .small[ Two paintings very rich in composition, of a beautiful execution, and whose merit is very remarkable, each 17 inches 3 lines high, 23 inches wide; the first, painted on wood, comes from the Cabinet of Madame la Comtesse de Verrue; it represents a departure for the hunt: it shows in the front a child on a white horse, a man who gives the horn to gather the dogs, a falconer and other figures nicely distributed across the width of the painting; two horses drinking from a fountain; on the right in the corner a lovely country house topped by a terrace, on which people are at the table, others who play instruments; trees and fabriques pleasantly enrich the background. ] ] --- ## Observation: Depart pour la chasse ```r pp %>% filter(name == "R1777-89a") %>% glimpse() ``` .small[ .pull-left[ ``` ## Rows: 1 ## Columns: 61 ## $ name <chr> "R1777-89a" ## $ sale <chr> "R1777" ## $ lot <chr> "89" ## $ position <dbl> 0.3755274 ## $ dealer <chr> "R" ## $ year <dbl> 1777 ## $ origin_author <chr> "D/FL" ## $ origin_cat <chr> "D/FL" ## $ school_pntg <chr> "D/FL" ## $ diff_origin <dbl> 0 ## $ logprice <dbl> 8.575462 ## $ price <dbl> 5300 ## $ count <dbl> 1 ## $ subject <chr> "D\u008epart pour la chasse" ## $ authorstandard <chr> "Wouwerman, Philips" ## $ artistliving <dbl> 0 ## $ authorstyle <chr> NA ## $ author <chr> "Philippe Wouwermans" ## $ winningbidder <chr> "Langlier, Jacques for Poullaiβ¦ ## $ winningbiddertype <chr> "DC" ## $ endbuyer <chr> "C" ... ``` ] .pull-right[ ``` ... ## $ Interm <dbl> 1 ## $ type_intermed <chr> "D" ## $ Height_in <dbl> 17.25 ## $ Width_in <dbl> 23 ## $ Surface_Rect <dbl> 396.75 ## $ Diam_in <dbl> NA ## $ Surface_Rnd <dbl> NA ## $ Shape <chr> "squ_rect" ## $ Surface <dbl> 396.75 ## $ material <chr> "bois" ## $ mat <chr> "b" ## $ materialCat <chr> "wood" ## $ quantity <dbl> 1 ## $ nfigures <dbl> 0 ## $ engraved <dbl> 0 ## $ original <dbl> 0 ## $ prevcoll <dbl> 1 ## $ othartist <dbl> 0 ## $ paired <dbl> 1 ## $ figures <dbl> 0 ## $ finished <dbl> 0 ... ``` ] ] --- ## Linear regression, two ways - **Goal:** Predict height of painting (`Height_in`) from width (`Width_in`) - **Approach:** 1. Using base R 2. Using tidymodels --- ## Approach 1. Base R - Fit: ```r base_lm <- lm(Height_in ~ Width_in, data = pp) ``` -- - Summarize: ```r summary(base_lm) ``` ``` ... ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.621406 0.253860 14.27 <2e-16 *** ## Width_in 0.780796 0.009505 82.15 <2e-16 *** ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ... ``` --- ## Approach 2. Tidymodels - Fit: ```r tidy_lm <- linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ Width_in, data = pp) ``` -- - Summarize: ```r tidy(tidy_lm) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3.62 0.254 14.3 8.82e-45 ## 2 Width_in 0.781 0.00950 82.1 0 ``` --- class: middle .large[ .hand[Let's rewind and redo, step-by-step, with tidymodels] ] --- ## Step 1: Specify model ```r linear_reg() ``` ``` ## Linear Regression Model Specification (regression) ``` --- ## Step 2: Set model fitting *engine* ```r linear_reg() %>% set_engine("lm") # lm: linear model ``` ``` ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` --- ## Step 3: Fit model & estimate parameters ... using **formula syntax** ```r linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ Width_in, data = pp) ``` ``` ## parsnip model object ## ## Fit time: 1ms ## ## Call: ## stats::lm(formula = Height_in ~ Width_in, data = data) ## ## Coefficients: ## (Intercept) Width_in ## 3.6214 0.7808 ``` --- ## Step 4: Summarize model output ```r linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ Width_in, data = pp) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3.62 0.254 14.3 8.82e-45 ## 2 Width_in 0.781 0.00950 82.1 0 ``` --- ## Linear regression, two ways ### Side-by-side comparison .pull-left[ Base R: .small[ ```r lm(Height_in ~ Width_in, data = pp) %>% summary() ``` ``` ## ## Call: ## lm(formula = Height_in ~ Width_in, data = pp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -86.714 -4.384 -2.422 3.169 85.084 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.621406 0.253860 14.27 <2e-16 *** ## Width_in 0.780796 0.009505 82.15 <2e-16 *** ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 8.304 on 3133 degrees of freedom ## (258 observations deleted due to missingness) ## Multiple R-squared: 0.6829, Adjusted R-squared: 0.6828 ## F-statistic: 6749 on 1 and 3133 DF, p-value: < 2.2e-16 ``` ]] .pull-right[ Tidymodels: .small[ ```r linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ Width_in, data = pp) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3.62 0.254 14.3 8.82e-45 ## 2 Width_in 0.781 0.00950 82.1 0 ``` ]] --- ## A note on summarizing model output - Models fit with base R can be summarized with `tidy()` as well - So you don't have to use the full tidymodels pipeline to benefit from the *tidy* output .pull-left[ Base R: .small[ ```r lm(Height_in ~ Width_in, data = pp) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3.62 0.254 14.3 8.82e-45 ## 2 Width_in 0.781 0.00950 82.1 0 ``` ]] .pull-right[ Tidymodels: .small[ ```r linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ Width_in, data = pp) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3.62 0.254 14.3 8.82e-45 ## 2 Width_in 0.781 0.00950 82.1 0 ``` ]] --- ## Tidy regression output Achieved with functions from the `broom` package: - `tidy`: Constructs a data frame that summarizes the model's statistical findings: coefficient estimates, *standard errors, test statistics, p-values*. - `glance`: Constructs a concise one-row summary of the model. This typically contains values such as `\(R^2\)`, adjusted `\(R^2\)`, *and residual standard errors that are computed once for the entire model*. - `augment`: Adds columns to the original data that was modeled. This includes predictions and residuals. --- ## Comparison: `glance()` **They're the same!** .pull-left[ Base R fit + `glance()`: .xsmall[ ```r glance(base_lm) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.683 0.683 8.30 6749. 0 1 ## # β¦ with 6 more variables: logLik <dbl>, AIC <dbl>, ## # BIC <dbl>, deviance <dbl>, df.residual <int>, ## # nobs <int> ``` ```r glance(base_lm)$r.squared ``` ``` ## [1] 0.6829468 ``` ]] .pull-right[ Tidymodels fit + `glance()`: .xsmall[ ```r glance(tidy_lm) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.683 0.683 8.30 6749. 0 1 ## # β¦ with 6 more variables: logLik <dbl>, AIC <dbl>, ## # BIC <dbl>, deviance <dbl>, df.residual <int>, ## # nobs <int> ``` ```r glance(tidy_lm)$r.squared ``` ``` ## [1] 0.6829468 ``` ]] --- ## Comparison: `augment()` **They're not exactly the same!** .pull-left[ Base R fit + `augment()`: .small[ ```r base_lm_aug <- augment(base_lm) base_lm_aug ``` ``` ## # A tibble: 3,135 x 9 ## .rownames Height_in Width_in .fitted .resid .hat ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 37 29.5 26.7 10.3 0.000399 ## 2 2 18 14 14.6 3.45 0.000396 ## 3 3 13 16 16.1 -3.11 0.000361 ## 4 4 14 18 17.7 -3.68 0.000337 ## 5 5 14 18 17.7 -3.68 0.000337 ## 6 6 7 10 11.4 -4.43 0.000498 ## # β¦ with 3,129 more rows, and 3 more variables: ## # .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl> ``` ]] .pull-right[ Tidymodels fit + `augment()`: .small[ ```r tidy_lm_aug <- augment(tidy_lm$fit) tidy_lm_aug ``` ``` ## # A tibble: 3,135 x 9 ## .rownames Height_in Width_in .fitted .resid .hat ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 37 29.5 26.7 10.3 0.000399 ## 2 2 18 14 14.6 3.45 0.000396 ## 3 3 13 16 16.1 -3.11 0.000361 ## 4 4 14 18 17.7 -3.68 0.000337 ## 5 5 14 18 17.7 -3.68 0.000337 ## 6 6 7 10 11.4 -4.43 0.000498 ## # β¦ with 3,129 more rows, and 3 more variables: ## # .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl> ``` ]] --- class: middle # Motivation --- ## Provide similar interfaces to models - Q: How do you define the the number of trees when fitting a random forest model? -- - A: Depends on the package: `randomForest::randomForest()` uses `ntree`, `ranger::ranger()` uses `num.trees`, Spark's `sparklyr::ml_random_forest()` uses `num_trees`. -- - A, with tidymodels: ```r rf_mod <- rand_forest(trees = 2000) ``` --- ## Help users avoid pitfalls <br> > *"Many models, especially complex predictive or machine learning models, can work very well on the data at hand but may fail when exposed to new data. Often, this issue is due to poor choices made during the development and/or selection of the models. Whenever possible, [tidymodels] software, documentation, and other materials attempt to prevent these and other pitfalls."* [[Source](https://www.tmwr.org/index.html)] <br> -- - Tidymodels pipelines start with splitting data into training and testing sets, and facilitate keeping them separate - Tidymodels is opioniated about which functions can be applied to testing data --- class: middle # Pedagogy --- ## For simple `lm`, might hard to see the advantage .pull-left[ Base R: .small[ ```r lm(Height_in ~ Width_in, data = pp) %>% summary() ``` ``` ## ## Call: ## lm(formula = Height_in ~ Width_in, data = pp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -86.714 -4.384 -2.422 3.169 85.084 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.621406 0.253860 14.27 <2e-16 *** ## Width_in 0.780796 0.009505 82.15 <2e-16 *** ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 8.304 on 3133 degrees of freedom ## (258 observations deleted due to missingness) ## Multiple R-squared: 0.6829, Adjusted R-squared: 0.6828 ## F-statistic: 6749 on 1 and 3133 DF, p-value: < 2.2e-16 ``` ]] .pull-right[ Tidymodels: .small[ ```r linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ Width_in, data = pp) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3.62 0.254 14.3 8.82e-45 ## 2 Width_in 0.781 0.00950 82.1 0 ``` ]] --- ## Though easier to see the advantage in the summary step .pull-left[ Base R: .small[ ```r lm(Height_in ~ Width_in, data = pp) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3.62 0.254 14.3 8.82e-45 ## 2 Width_in 0.781 0.00950 82.1 0 ``` ]] .pull-right[ Tidymodels: .small[ ```r linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ Width_in, data = pp) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3.62 0.254 14.3 8.82e-45 ## 2 Width_in 0.781 0.00950 82.1 0 ``` ]] --- ## Pedagogical advantages more clear as you move on to different models .pull-left[ ```r linear_reg() %>% set_engine("lm") %>% set_mode("regression") logistic_reg() %>% set_engine("glm") %>% set_mode("classification") rand_forest() %>% set_engine("ranger") %>% set_mode("regression") decision_tree() %>% set_engine("rpart") %>% set_mode("regression") ... ``` ] -- .pull-right[ ```r svm_linear() %>% set_engine("LiblineaR") %>% set_mode("regression") ``` ``` ## Linear Support Vector Machine Specification (regression) ## ## Computational engine: LiblineaR ``` ```r svm_linear() %>% set_engine("LiblineaR") %>% set_mode("classification") ``` ``` ## Linear Support Vector Machine Specification (classification) ## ## Computational engine: LiblineaR ``` ] --- ## Build on dplyr pipelines to build recipes .pull-left[ ```r email %>% select(-from, -sent_email) %>% mutate( day_of_week = lubridate::wday(time), month = lubridate::month(time) ) %>% select(-time) %>% mutate( cc = cut(cc, breaks = c(0, 1)), attach = cut(attach, breaks = c(0, 1)), dollar = cut(dollar, breaks = c(0, 1)) ) %>% mutate( inherit = cut(inherit, breaks = c(0, 1, 5, 10, 20)), password = cut(password, breaks = c(0, 1, 5, 10, 20)) ) %>% ... ``` ] -- .pull-right[ ```r recipe(spam ~ ., data = email) %>% step_rm(from, sent_email) %>% step_date( time, features = c("dow", "month") ) %>% step_rm(time) %>% step_cut( cc, attach, dollar, breaks = c(0, 1) ) %>% step_cut( inherit, password, breaks = c(0, 1, 5, 10, 20) ) %>% ... ``` ] --- class: middle, inverse ## .larger[.dark-blue[.hand[Case studies]]] --- class: middle # Case study 1: Predicting book weights .hand[Keeping it simple to begin!] --- ## Data: Book weight and volume The `allbacks` data frame gives measurements on the volume and weight of 15 books, some of which are paperback and some of which are hardback (a.k.a. hardcover) .pull-left[ - Volume - cubic centimeters - Area - square centimeters - Weight - grams ] .pull-right[ .small[ ``` ## Loading required package: lattice ``` ``` ## # A tibble: 15 x 4 ## volume area weight cover ## <dbl> <dbl> <dbl> <fct> ## 1 885 382 800 hb ## 2 1016 468 950 hb ## 3 1125 387 1050 hb ## 4 239 371 350 hb ## 5 701 371 750 hb ## 6 641 367 600 hb ## 7 1228 396 1075 hb ## 8 412 0 250 pb ## 9 953 0 700 pb ## 10 929 0 650 pb ## 11 1492 0 975 pb ## 12 419 0 350 pb ## 13 1010 0 950 pb ## 14 595 0 425 pb ## 15 1034 0 725 pb ``` ] ] .footnote[ .small[ These books are from the bookshelf of J. H. Maindonald at Australian National University. ] ] --- ## Book weight vs. volume <img src="tidy-up-models_files/figure-html/unnamed-chunk-34-1.png" width="75%" style="display: block; margin: auto;" /> --- ## Book weight vs. volume and cover <img src="tidy-up-models_files/figure-html/unnamed-chunk-35-1.png" width="75%" style="display: block; margin: auto;" /> --- class: middle .your-turn[ **Your turn:** Let's fit a simple multiple linear regression model! - Go to RStudio Cloud: [bit.ly/tidymodels-cloud](https://bit.ly/tidymodels-cloud) - Join the space and click on **Projects** on top - **Start** the project titled **Books** - Open the R Markdown document called `books.Rmd` and follow the instructions ]
10
:
00
--- class: middle # Case study 2: Predicting spam .hand[Leveraging (much of) the tidymodels framework to continue...] --- ## Spam filters .pull-left-narrow[ - Data from 3921 emails and 21 variables on them - Outcome: whether the email is spam or not - Predictors: number of characters, whether the email had "Re:" in the subject, time at which email was sent, number of times the word "inherit" shows up in the email, etc. ] .pull-right-wide[ .small[ ```r library(openintro) glimpse(email) ``` ``` ## Rows: 3,921 ## Columns: 21 ## $ spam <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ to_multiple <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,β¦ ## $ from <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,β¦ ## $ cc <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,β¦ ## $ sent_email <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,β¦ ## $ time <dttm> 2012-01-01 01:16:41, 2012-01-01 02β¦ ## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,β¦ ## $ attach <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,β¦ ## $ dollar <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ winner <fct> no, no, no, no, no, no, no, no, no,β¦ ## $ inherit <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ password <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0,β¦ ## $ num_char <dbl> 11.370, 10.504, 7.773, 13.256, 1.23β¦ ## $ line_breaks <int> 202, 202, 192, 255, 29, 25, 193, 23β¦ ## $ format <fct> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0,β¦ ## $ re_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,β¦ ## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ urgent_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ exclaim_mess <dbl> 0, 1, 6, 48, 1, 1, 1, 18, 1, 0, 2, β¦ ## $ number <fct> big, small, small, small, none, nonβ¦ ``` ] ] --- class: middle .large[ .hand[first we warm the students up to the data...] ] --- .question[ Would you expect longer or shorter emails to be spam? ] -- .pull-left[ ``` ## Picking joint bandwidth of 1.18 ``` <img src="tidy-up-models_files/figure-html/unnamed-chunk-38-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ``` ## # A tibble: 2 x 2 ## spam mean_num_char ## <fct> <dbl> ## 1 0 11.3 ## 2 1 5.44 ``` ] --- .question[ Would you expect emails that have subjects starting with "Re:", "RE:", "re:", or "rE:" to be spam or not? ] -- <img src="tidy-up-models_files/figure-html/unnamed-chunk-40-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle .large[ .hand[then we introduce logistic regression and generalized linear models (GLMs) in general, build a few logistic regression models with one or two predictors, interpret results...] ] --- class: middle .your-turn[ **Demo:** Let's fit a logistic regression model! - Go to RStudio Cloud: [bit.ly/tidymodels-cloud](https://bit.ly/tidymodels-cloud) - Join the space (if you haven't yet done so) and click on **Projects** on top - **Start** the project titled **Spam** - Open the R Markdown document called `spam.Rmd` and follow along! ] --- ## Predicting spam .panelset[ .panel[.panel-name[Output] .small[ ``` ## # A tibble: 22 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -9.09e+1 9.80e+3 -0.00928 9.93e- 1 ## 2 to_multiple1 -2.68e+0 3.27e-1 -8.21 2.25e-16 ## 3 from1 -2.19e+1 9.80e+3 -0.00224 9.98e- 1 ## 4 cc 1.88e-2 2.20e-2 0.855 3.93e- 1 ## 5 sent_email1 -2.07e+1 3.87e+2 -0.0536 9.57e- 1 ## 6 time 8.48e-8 2.85e-8 2.98 2.92e- 3 ## 7 image -1.78e+0 5.95e-1 -3.00 2.73e- 3 ## 8 attach 7.35e-1 1.44e-1 5.09 3.61e- 7 ## 9 dollar -6.85e-2 2.64e-2 -2.59 9.64e- 3 ## 10 winneryes 2.07e+0 3.65e-1 5.67 1.41e- 8 ## 11 inherit 3.15e-1 1.56e-1 2.02 4.32e- 2 ## 12 viagra 2.84e+0 2.22e+3 0.00128 9.99e- 1 ## 13 password -8.54e-1 2.97e-1 -2.88 4.03e- 3 ## 14 num_char 5.06e-2 2.38e-2 2.13 3.35e- 2 ## 15 line_breaks -5.49e-3 1.35e-3 -4.06 4.91e- 5 ## 16 format1 -6.14e-1 1.49e-1 -4.14 3.53e- 5 ## 17 re_subj1 -1.64e+0 3.86e-1 -4.25 2.16e- 5 ## 18 exclaim_subj 1.42e-1 2.43e-1 0.585 5.58e- 1 ## 19 urgent_subj1 3.88e+0 1.32e+0 2.95 3.18e- 3 ## 20 exclaim_mess 1.08e-2 1.81e-3 5.98 2.23e- 9 ## 21 numbersmall -1.19e+0 1.54e-1 -7.74 9.62e-15 ## 22 numberbig -2.95e-1 2.20e-1 -1.34 1.79e- 1 ``` ] ] .panel[.panel-name[Code] ```r logistic_reg() %>% set_engine("glm") %>% fit(spam ~ ., data = email, family = "binomial") %>% tidy() %>% print(n = 22) ``` ``` ## Warning: glm.fit: fitted probabilities numerically 0 or 1 ## occurred ``` ] ] --- ## Prediction - The mechanics of prediction is **easy**: - Plug in values of predictors to the model equation - Calculate the predicted value of the response variable, `\(\hat{y}\)` -- - Getting it right is **hard**! - There is no guarantee the model estimates you have are correct - Or that your model will perform as well with new data as it did with your sample data --- ## Spending our data - Several steps to create a useful model: parameter estimation, model selection, performance assessment, etc. - Doing all of this on the entire data we have available can lead to **overfitting** - Allocate specific subsets of data for different tasks, as opposed to allocating the largest possible amount to the model parameter estimation only (which is what we've done so far) --- ## Splitting data - **Training set:** - Sandbox for model building - Spend most of your time using the training set to develop the model - Majority of the data (usually 80%) - **Testing set:** - Held in reserve to determine efficacy of one or two chosen models - Critical to look at it once, otherwise it becomes part of the modeling process - Remainder of the data (usually 20%) --- ## Performing the split ```r # Fix random numbers by setting the seed # Enables analysis to be reproducible when random numbers are used set.seed(1116) # Put 80% of the data into the training set email_split <- initial_split(email, prop = 0.80) # Create data frames for the two sets: train_data <- training(email_split) test_data <- testing(email_split) ``` --- ## Peek at the split .small[ .pull-left[ ```r glimpse(train_data) ``` ``` ## Rows: 3,136 ## Columns: 21 ## $ spam <fct> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,β¦ ## $ to_multiple <fct> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,β¦ ## $ from <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,β¦ ## $ cc <int> 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ sent_email <fct> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ time <dttm> 2012-01-25 17:46:55, 2012-01-03 00β¦ ## $ image <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,β¦ ## $ attach <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,β¦ ## $ dollar <dbl> 10, 0, 0, 0, 0, 0, 13, 0, 0, 0, 2, β¦ ## $ winner <fct> no, no, no, no, no, no, no, yes, noβ¦ ## $ inherit <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,β¦ ## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ password <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ num_char <dbl> 23.308, 1.162, 4.732, 42.238, 1.228β¦ ## $ line_breaks <int> 477, 2, 127, 712, 30, 674, 367, 226β¦ ## $ format <fct> 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1,β¦ ## $ re_subj <fct> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,β¦ ## $ urgent_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ exclaim_mess <dbl> 12, 0, 2, 2, 2, 31, 2, 0, 0, 1, 0, β¦ ## $ number <fct> small, none, big, big, small, smallβ¦ ``` ] .pull-right[ ```r glimpse(test_data) ``` ``` ## Rows: 785 ## Columns: 21 ## $ spam <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ to_multiple <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,β¦ ## $ from <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,β¦ ## $ cc <int> 0, 1, 0, 1, 4, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ sent_email <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ time <dttm> 2012-01-01 12:55:06, 2012-01-01 14β¦ ## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ attach <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ dollar <dbl> 0, 0, 5, 0, 0, 0, 0, 5, 4, 0, 0, 0,β¦ ## $ winner <fct> no, no, no, no, no, no, no, no, no,β¦ ## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,β¦ ## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ password <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,β¦ ## $ num_char <dbl> 4.837, 15.075, 18.037, 45.842, 11.4β¦ ## $ line_breaks <int> 193, 354, 345, 881, 125, 24, 296, 1β¦ ## $ format <fct> 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0,β¦ ## $ re_subj <fct> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,β¦ ## $ exclaim_subj <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ urgent_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,β¦ ## $ exclaim_mess <dbl> 1, 10, 20, 5, 2, 0, 0, 0, 6, 0, 0, β¦ ## $ number <fct> big, small, small, big, small, noneβ¦ ``` ] ] --- ## Feature engineering - We prefer simple models when possible, but **parsimony** does not mean sacrificing accuracy (or predictive performance) in the interest of simplicity -- - Variables that go into the model and how they are represented are just as critical to success of the model -- - **Feature engineering** allows us to get creative with our predictors in an effort to make them more useful for our model (to increase its predictive performance) --- ## A simple approach: `mutate()` ```r library(lubridate) train_data %>% mutate( date = date(time), dow = wday(time), month = month(time) ) %>% select(time, date, dow, month) %>% sample_n(size = 5) # shuffle to show a variety ``` ``` ## # A tibble: 5 x 4 ## time date dow month ## <dttm> <date> <dbl> <dbl> ## 1 2012-03-15 14:51:35 2012-03-15 5 3 ## 2 2012-03-03 09:24:02 2012-03-03 7 3 ## 3 2012-01-18 11:55:23 2012-01-18 4 1 ## 4 2012-02-24 23:08:59 2012-02-24 6 2 ## 5 2012-01-11 08:18:51 2012-01-11 4 1 ``` --- ## Modeling workflow, revisited - Create a **recipe** for feature engineering steps to be applied to the training data -- - Fit the model to the training data after these steps have been applied -- - Using the model estimates from the training data, predict outcomes for the test data -- - Evaluate the performance of the model on the test data --- ## You gotta keep them separated! - We might not want to use the data we have as is - We might want to "engineer features", omit certain variables, etc. - And we need a way of doing this in a way that can be applied to both training and testing data, ideally without making us do extra work - Enter **recipes** to help with this! --- ## Initiate a recipe ```r email_rec <- recipe( spam ~ ., # formula data = train_data # data to use for cataloguing names and types of variables ) summary(email_rec) ``` .xsmall[ ``` ## # A tibble: 21 x 4 ## variable type role source ## <chr> <chr> <chr> <chr> ## 1 to_multiple nominal predictor original ## 2 from nominal predictor original ## 3 cc numeric predictor original ## 4 sent_email nominal predictor original ## 5 time date predictor original ## 6 image numeric predictor original ## 7 attach numeric predictor original ## 8 dollar numeric predictor original ## 9 winner nominal predictor original ## 10 inherit numeric predictor original ## 11 viagra numeric predictor original ## 12 password numeric predictor original ## 13 num_char numeric predictor original ## 14 line_breaks numeric predictor original ## 15 format nominal predictor original ## 16 re_subj nominal predictor original ## 17 exclaim_subj numeric predictor original ## 18 urgent_subj nominal predictor original ## 19 exclaim_mess numeric predictor original ## 20 number nominal predictor original ## 21 spam nominal outcome original ``` ] --- ## Remove certain variables ```r email_rec <- email_rec %>% step_rm(from, sent_email) ``` .small[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 20 ## ## Operations: ## ## Delete terms from, sent_email ``` ] --- ## Feature engineer date ```r email_rec <- email_rec %>% step_date(time, features = c("dow", "month")) %>% step_rm(time) ``` .small[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 20 ## ## Operations: ## ## Delete terms from, sent_email ## Date features from time ## Delete terms time ``` ] --- ## Discretize numeric variables ```r email_rec <- email_rec %>% step_cut(cc, attach, dollar, breaks = c(0, 1)) %>% step_cut(inherit, password, breaks = c(0, 1, 5, 10, 20)) ``` .small[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 20 ## ## Operations: ## ## Delete terms from, sent_email ## Date features from time ## Delete terms time ## Cut numeric for cc, attach, dollar ## Cut numeric for inherit, password ``` ] --- ## Create dummy variables ```r email_rec <- email_rec %>% step_dummy(all_nominal(), -all_outcomes()) ``` .small[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 20 ## ## Operations: ## ## Delete terms from, sent_email ## Date features from time ## Delete terms time ## Cut numeric for cc, attach, dollar ## Cut numeric for inherit, password ## Dummy variables from all_nominal(), -all_outcomes() ``` ] --- ## Remove zero variance variables Variables that contain only a single value ```r email_rec <- email_rec %>% step_zv(all_predictors()) ``` .small[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 20 ## ## Operations: ## ## Delete terms from, sent_email ## Date features from time ## Delete terms time ## Cut numeric for cc, attach, dollar ## Cut numeric for inherit, password ## Dummy variables from all_nominal(), -all_outcomes() ## Zero variance filter on all_predictors() ``` ] --- ## All in one place ```r email_rec <- recipe(spam ~ ., data = email) %>% step_rm(from, sent_email) %>% step_date(time, features = c("dow", "month")) %>% step_rm(time) %>% step_cut(cc, attach, dollar, breaks = c(0, 1)) %>% step_cut(inherit, password, breaks = c(0, 1, 5, 10, 20)) %>% step_dummy(all_nominal(), -all_outcomes()) %>% step_zv(all_predictors()) ``` --- ## What's next? - We have our data split into testing and training - We know the recipe steps we want to apply to our testing and training data - We now need to define the model to fit and collect model results - Lots of bookkeeping is needed to make sure we're not training our model on testing data - Enter **workflows** to help with this! --- ## Define model ```r email_mod <- logistic_reg() %>% set_engine("glm") email_mod ``` ``` ## Logistic Regression Model Specification (classification) ## ## Computational engine: glm ``` --- ## Define workflow **Workflows** bring together models and recipes so that they can be easily applied to both the training and test data. ```r email_wflow <- workflow() %>% add_model(email_mod) %>% add_recipe(email_rec) ``` .small[ ``` ## ββ Workflow ββββββββββββββββββββββββββββββββββββββββββββββ ## Preprocessor: Recipe ## Model: logistic_reg() ## ## ββ Preprocessor ββββββββββββββββββββββββββββββββββββββββββ ## 7 Recipe Steps ## ## β’ step_rm() ## β’ step_date() ## β’ step_rm() ## β’ step_cut() ## β’ step_cut() ## β’ step_dummy() ## β’ step_zv() ## ## ββ Model βββββββββββββββββββββββββββββββββββββββββββββββββ ## Logistic Regression Model Specification (classification) ## ## Computational engine: glm ``` ] --- ## Fit model to training data ```r email_fit <- email_wflow %>% fit(data = train_data) ``` --- .small[ ```r tidy(email_fit) %>% print(n = 31) ``` ``` ## # A tibble: 31 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -0.892 2.49e-1 -3.58 3.37e- 4 ## 2 image -1.65 9.34e-1 -1.76 7.77e- 2 ## 3 viagra 2.28 1.82e+2 0.0125 9.90e- 1 ## 4 num_char 0.0470 2.44e-2 1.93 5.36e- 2 ## 5 line_breaks -0.00510 1.38e-3 -3.69 2.28e- 4 ## 6 exclaim_subj -0.204 2.77e-1 -0.736 4.62e- 1 ## 7 exclaim_mess 0.00885 1.86e-3 4.75 1.99e- 6 ## 8 to_multiple_X1 -2.60 3.54e-1 -7.35 2.06e-13 ## 9 cc_X.1.68. -0.312 4.90e-1 -0.638 5.24e- 1 ## 10 attach_X.1.21. 2.05 3.68e-1 5.58 2.45e- 8 ## 11 dollar_X.1.64. 0.214 2.17e-1 0.988 3.23e- 1 ## 12 winner_yes 2.18 4.28e-1 5.08 3.68e- 7 ## 13 inherit_X.1.5. -9.21 7.65e+2 -0.0120 9.90e- 1 ## 14 inherit_X.5.10. 2.51 1.44e+0 1.74 8.12e- 2 ## 15 password_X.1.5. -1.71 7.48e-1 -2.28 2.24e- 2 ## 16 password_X.5.10. -12.5 4.75e+2 -0.0263 9.79e- 1 ## 17 password_X.10.2β¦ -13.7 8.14e+2 -0.0168 9.87e- 1 ## 18 password_X.20.2β¦ -13.9 1.03e+3 -0.0135 9.89e- 1 ## 19 format_X1 -0.916 1.59e-1 -5.77 7.79e- 9 ## 20 re_subj_X1 -2.90 4.37e-1 -6.65 2.95e-11 ## 21 urgent_subj_X1 3.52 1.08e+0 3.25 1.15e- 3 ## 22 number_small -0.895 1.67e-1 -5.35 8.75e- 8 ## 23 number_big -0.199 2.50e-1 -0.797 4.25e- 1 ## 24 time_dow_Mon 0.0441 2.96e-1 0.149 8.82e- 1 ## 25 time_dow_Tue 0.371 2.67e-1 1.39 1.64e- 1 ## 26 time_dow_Wed -0.133 2.72e-1 -0.488 6.26e- 1 ## 27 time_dow_Thu 0.0392 2.77e-1 0.141 8.88e- 1 ## 28 time_dow_Fri 0.0488 2.80e-1 0.174 8.62e- 1 ## 29 time_dow_Sat 0.253 2.98e-1 0.849 3.96e- 1 ## 30 time_month_Feb 0.784 1.80e-1 4.35 1.37e- 5 ## 31 time_month_Mar 0.541 1.81e-1 2.99 2.79e- 3 ``` ] --- ## Make predictions for test data ```r email_pred <- predict(email_fit, test_data, type = "prob") %>% bind_cols(test_data) email_pred ``` ``` ## # A tibble: 785 x 23 ## .pred_0 .pred_1 spam to_multiple from cc sent_email ## <dbl> <dbl> <fct> <fct> <fct> <int> <fct> ## 1 0.995 4.70e-3 0 1 1 0 1 ## 2 0.999 1.34e-3 0 0 1 1 1 ## 3 0.967 3.28e-2 0 0 1 0 0 ## 4 0.999 7.76e-4 0 0 1 1 0 ## 5 0.994 6.42e-3 0 0 1 4 0 ## 6 0.860 1.40e-1 0 0 1 0 0 ## # β¦ with 779 more rows, and 16 more variables: ## # time <dttm>, image <dbl>, attach <dbl>, dollar <dbl>, ## # winner <fct>, inherit <dbl>, viagra <dbl>, ## # password <dbl>, num_char <dbl>, line_breaks <int>, ## # format <fct>, re_subj <fct>, exclaim_subj <dbl>, ## # urgent_subj <fct>, exclaim_mess <dbl>, number <fct> ``` --- ## Evaluate the performance .pull-left[ ```r email_pred %>% roc_curve( truth = spam, .pred_1, event_level = "second" ) %>% autoplot() ``` ] .pull-right[ <img src="tidy-up-models_files/figure-html/unnamed-chunk-64-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Evaluate the performance .pull-left[ ```r email_pred %>% roc_auc( truth = spam, .pred_1, event_level = "second" ) ``` ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 roc_auc binary 0.857 ``` ] .pull-right[ <img src="tidy-up-models_files/figure-html/unnamed-chunk-66-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle .large[ .hand[and we could keep going with performing cross validation, and much more...] ] --- class: middle, inverse ## .larger[.dark-blue[.hand[Wrap up]]] --- class: middle # What about inference? --- ## Suppose... Estimate the difference between the average evaluation score of male and female faculty. .midi[ ```r library(openintro) evals %>% select(score, rank, ethnicity, gender, bty_avg) ``` ``` ## # A tibble: 463 x 5 ## score rank ethnicity gender bty_avg ## <dbl> <fct> <fct> <fct> <dbl> ## 1 4.7 tenure track minority female 5 ## 2 4.1 tenure track minority female 5 ## 3 3.9 tenure track minority female 5 ## 4 4.8 tenure track minority female 5 ## 5 4.6 tenured not minority male 3 ## 6 4.3 tenured not minority male 3 ## # β¦ with 457 more rows ``` ] --- .pull-left[ .midi[ ```r evals %>% specify(score ~ gender) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "diff in means", order = c("male", "female")) %>% summarise( l = quantile(stat, 0.025), u = quantile(stat, 0.975) ) ``` ``` ## # A tibble: 1 x 2 ## l u ## <dbl> <dbl> ## 1 0.0445 0.243 ``` ] ] .pull-right[ .midi[ ```r t.test(evals$score ~ evals$gender) ``` ``` ## ## Welch Two Sample t-test ## ## data: evals$score by evals$gender ## t = -2.7507, df = 398.7, p-value = 0.006218 ## alternative hypothesis: true difference in means between group female and group male is not equal to 0 ## 95 percent confidence interval: ## -0.24264375 -0.04037194 ## sample estimates: ## mean in group female mean in group male ## 4.092821 4.234328 ``` ] ] --- ## infer `\(\in\)` tidymodels .pull-left[ The objective of **infer** is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse design framework. ] .pull-right[ <img src="images/infer.png" title="Hex sticker for the infer package." alt="Hex sticker for the infer package." width="60%" style="display: block; margin: auto;" /> ] --- .midi[ ```r evals %>% specify(score ~ gender) ``` ``` ## Response: score (numeric) ## Explanatory: gender (factor) ## # A tibble: 463 x 2 ## score gender ## <dbl> <fct> ## 1 4.7 female ## 2 4.1 female ## 3 3.9 female ## 4 4.8 female ## 5 4.6 male ## 6 4.3 male ## # β¦ with 457 more rows ``` ] --- .midi[ ```r set.seed(1234) evals %>% specify(score ~ gender) %>% generate(reps = 1000, type = "bootstrap") ``` ``` ## Response: score (numeric) ## Explanatory: gender (factor) ## # A tibble: 463,000 x 3 ## # Groups: replicate [1,000] ## replicate score gender ## <int> <dbl> <fct> ## 1 1 4 female ## 2 1 3.1 male ## 3 1 5 male ## 4 1 4.4 male ## 5 1 3.5 female ## 6 1 4.5 female ## # β¦ with 462,994 more rows ``` ] --- .midi[ ```r set.seed(1234) evals %>% specify(score ~ gender) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "diff in means", order = c("male", "female")) ``` ``` ## # A tibble: 1,000 x 2 ## replicate stat ## <int> <dbl> ## 1 1 0.230 ## 2 2 0.134 ## 3 3 0.100 ## 4 4 0.230 ## 5 5 0.128 ## 6 6 0.201 ## # β¦ with 994 more rows ``` ] --- .midi[ ```r set.seed(1234) evals %>% specify(score ~ gender) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "diff in means", order = c("male", "female")) %>% visualise() ``` <img src="tidy-up-models_files/figure-html/infer-4-1.png" width="60%" style="display: block; margin: auto;" /> ] --- .midi[ ```r set.seed(1234) evals %>% specify(score ~ gender) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "diff in means", order = c("male", "female")) %>% summarise(l = quantile(stat, 0.025), u = quantile(stat, 0.975)) ``` ``` ## # A tibble: 1 x 2 ## l u ## <dbl> <dbl> ## 1 0.0407 0.236 ``` ] --- ## Learning resources .large[ π **Get started** with tidymodels: [tidymodels.org/start](https://www.tidymodels.org/start/) π **Learn** more and go further: [tidymodels.org/learn](https://www.tidymodels.org/learn/) π **Tidy Modeling with R** by Max Kuhn & Julia Silge: [tmwr.org](https://www.tmwr.org/) ] --- ## Teaching resources **Data Science in a Box** ([datasciencebox.org](https://datasciencebox.org/)) contains slides, application exercises, computing labs, and homework assignments on modelling and inference with tidymodels: .midi[ - Slides and application exercises: - [Modelling data](https://datasciencebox.org/making-rigorous-conclusions.html#modelling-data) - [Classification and model building](https://datasciencebox.org/making-rigorous-conclusions.html#classification-and-model-building) - [Model validation](https://datasciencebox.org/making-rigorous-conclusions.html#model-validation) - [Uncertainty quantification](https://datasciencebox.org/making-rigorous-conclusions.html#uncertainty-quantification) - [Labs](https://datasciencebox.org/making-rigorous-conclusions.html#labs-2) - Grading the professor: Fitting and interpreting simple linear regression models - Smoking while pregnant: Constructing confidence intervals, conducting hypothesis tests, and interpreting results in context of the data - [Homework assignments](https://datasciencebox.org/making-rigorous-conclusions.html#homework-assignments-1): - Bike rentals in DC: Exploratory data analysis and fitting and interpreting models - Exploring the GSS: Fitting and interpreting models - Modelling the GSS: Model validation and inference ] --- class: center, middle .huge[ .hand[thank you!] ] .large[ π [bit.ly/tidymodels-uscots21](https://bit.ly/tidymodels-uscots21) ]