Tidy up your models

# Tidy up your models
## <br>🔗 <a href="https://bit.ly/tidymodels-uscots21/">bit.ly/tidymodels-uscots21</a>
### <br>Mine Çetinkaya-Rundel and Debbie Yuster
### USCOTS 2021

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://bit.ly/tidymodels-uscots21" target="_blank">bit.ly/tidymodels-uscots21</a>
</span>
</div>

---

## .larger[.dark-blue[.hand[Welcome]]]

---

🔗 [bit.ly/tidymodels-uscots21](https://bit.ly/tidymodels-uscots21)
]]

---

.center[
.pull-left[
<img src="images/mine.png" width="60%" style="display: block; margin: auto;" />
Mine Çetinkaya-Rundel <br>
Duke University, RStudio
]
.pull-right[
<img src="images/debbie.png" width="60%" style="display: block; margin: auto;" />
Debbie Yuster <br>
Ramapo College
]
]

.center[
<img src="images/julia.png" width="11%" style="display: block; margin: auto;" />
Julia Silge  
RStudio
]

---

## .larger[.dark-blue[.hand[Hello tidymodels]]]

---

# High level view: tidymodels

---

.pull-left[
<img src="images/hello-tidymodels.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
The **tidymodels** framework is a collection of packages for modeling and machine learning using tidyverse principles.

```r
install.packages("tidymodels")
```
]

---

```r
library(tidymodels)
```

```
## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip
```

```
## ── Attaching packages ──────────────── tidymodels 0.1.3 ──
```

```
## ✓ broom        0.7.8      ✓ rsample      0.1.0 
## ✓ dials        0.0.9      ✓ tune         0.1.5 
## ✓ infer        0.5.4      ✓ workflows    0.2.2 
## ✓ modeldata    0.1.0      ✓ workflowsets 0.0.2 
## ✓ parsnip      0.1.6      ✓ yardstick    0.0.8 
## ✓ recipes      0.1.16
```

```
## ── Conflicts ─────────────────── tidymodels_conflicts() ──
## x recipes::check()   masks devtools::check()
## x scales::discard()  masks purrr::discard()
## x dplyr::filter()    masks stats::filter()
## x recipes::fixed()   masks stringr::fixed()
## x purrr::is_null()   masks testthat::is_null()
## x dplyr::lag()       masks stats::lag()
## x rsample::matches() masks dplyr::matches(), tidyr::matches(), testthat::matches()
## x yardstick::spec()  masks readr::spec()
## x recipes::step()    masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
```
]

---

---

## Plan

1. First, let's talk about **syntax**
2. Then, a little bit on **motivation**
3. And finally **pedagogy**

---

# Syntax

---

## Data: Paris Paintings

- Source: Printed catalogs of 28 auction sales in Paris, 1764 - 1780
- 3,393 paintings, their prices, and descriptive details from sales catalogs - over 60 variables

```r
pp <- read_csv("data/paris-paintings.csv", na = c("n/a", "", "NA"))
pp
```

```
## # A tibble: 3,393 x 61
##   name     sale  lot   position dealer  year origin_author
##   <chr>    <chr> <chr>    <dbl> <chr>  <dbl> <chr>        
## 1 L1764-2  L1764 2       0.0328 L       1764 F            
## 2 L1764-3  L1764 3       0.0492 L       1764 I            
## 3 L1764-4  L1764 4       0.0656 L       1764 X            
## 4 L1764-5a L1764 5       0.0820 L       1764 F            
## 5 L1764-5b L1764 5       0.0820 L       1764 F            
## 6 L1764-6  L1764 6       0.0984 L       1764 X            
## # … with 3,387 more rows, and 54 more variables:
## #   origin_cat <chr>, school_pntg <chr>,
## #   diff_origin <dbl>, logprice <dbl>, price <dbl>,
...
```
]

---

## Observation: Depart pour la chasse

.pull-left[
<img src="images/auction-catalogue.png" width="65%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="images/depart-pour-la-chasse.png" width="58%" style="display: block; margin: auto;" />
.small[
Two paintings very rich in composition, of a beautiful execution, and whose merit is very remarkable, each 17 inches 3 lines high, 23 inches wide; the first, painted on wood, comes from the Cabinet of Madame la Comtesse de Verrue; it represents a departure for the hunt: it shows in the front a child on a white horse, a man who gives the horn to gather the dogs, a falconer and other figures nicely distributed across the width of the painting; two horses drinking from a fountain; on the right in the corner a lovely country house topped by a terrace, on which people are at the table, others who play instruments; trees and fabriques pleasantly enrich the background.
]
]

---

## Observation: Depart pour la chasse

```r
pp %>%
  filter(name == "R1777-89a") %>%
  glimpse()
```

```
## Rows: 1
## Columns: 61
## $ name              <chr> "R1777-89a"
## $ sale              <chr> "R1777"
## $ lot               <chr> "89"
## $ position          <dbl> 0.3755274
## $ dealer            <chr> "R"
## $ year              <dbl> 1777
## $ origin_author     <chr> "D/FL"
## $ origin_cat        <chr> "D/FL"
## $ school_pntg       <chr> "D/FL"
## $ diff_origin       <dbl> 0
## $ logprice          <dbl> 8.575462
## $ price             <dbl> 5300
## $ count             <dbl> 1
## $ subject           <chr> "D\u008epart pour la chasse"
## $ authorstandard    <chr> "Wouwerman, Philips"
## $ artistliving      <dbl> 0
## $ authorstyle       <chr> NA
## $ author            <chr> "Philippe Wouwermans"
## $ winningbidder     <chr> "Langlier, Jacques for Poullai…
## $ winningbiddertype <chr> "DC"
## $ endbuyer          <chr> "C"
...
```
]
.pull-right[

```
...
## $ Interm            <dbl> 1
## $ type_intermed     <chr> "D"
## $ Height_in         <dbl> 17.25
## $ Width_in          <dbl> 23
## $ Surface_Rect      <dbl> 396.75
## $ Diam_in           <dbl> NA
## $ Surface_Rnd       <dbl> NA
## $ Shape             <chr> "squ_rect"
## $ Surface           <dbl> 396.75
## $ material          <chr> "bois"
## $ mat               <chr> "b"
## $ materialCat       <chr> "wood"
## $ quantity          <dbl> 1
## $ nfigures          <dbl> 0
## $ engraved          <dbl> 0
## $ original          <dbl> 0
## $ prevcoll          <dbl> 1
## $ othartist         <dbl> 0
## $ paired            <dbl> 1
## $ figures           <dbl> 0
## $ finished          <dbl> 0
...
```
]
]

---

## Linear regression, two ways

- **Goal:** Predict height of painting (`Height_in`) from width (`Width_in`)

- **Approach:**
  1. Using base R
  2. Using tidymodels

---

## Approach 1. Base R

- Fit:

```r
base_lm <- lm(Height_in ~ Width_in, data = pp)
```

- Summarize:

```r
summary(base_lm)
```

```
...
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.621406   0.253860   14.27   <2e-16 ***
## Width_in    0.780796   0.009505   82.15   <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
...
```

---

## Approach 2. Tidymodels

- Fit:

```r
tidy_lm <- linear_reg() %>%
  set_engine("lm") %>%
  fit(Height_in ~ Width_in, data = pp)
```

- Summarize:

```r
tidy(tidy_lm)
```

```
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    3.62    0.254        14.3 8.82e-45
## 2 Width_in       0.781   0.00950      82.1 0
```

---

---

## Step 1: Specify model

```r
linear_reg()
```

```
## Linear Regression Model Specification (regression)
```

---

## Step 2: Set model fitting *engine*

```r
linear_reg() %>%
  set_engine("lm") # lm: linear model
```

```
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
```

---

## Step 3: Fit model & estimate parameters

... using **formula syntax**

```r
linear_reg() %>%
  set_engine("lm") %>%
  fit(Height_in ~ Width_in, data = pp)
```

```
## parsnip model object
## 
## Fit time:  1ms 
## 
## Call:
## stats::lm(formula = Height_in ~ Width_in, data = data)
## 
## Coefficients:
## (Intercept)     Width_in  
##      3.6214       0.7808
```

---

## Step 4: Summarize model output

```r
linear_reg() %>%
  set_engine("lm") %>%
  fit(Height_in ~ Width_in, data = pp) %>%
  tidy()
```

---

## Linear regression, two ways

### Side-by-side comparison

```r
lm(Height_in ~ Width_in, data = pp) %>%
  summary()
```

```
## 
## Call:
## lm(formula = Height_in ~ Width_in, data = pp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -86.714  -4.384  -2.422   3.169  85.084 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.621406   0.253860   14.27   <2e-16 ***
## Width_in    0.780796   0.009505   82.15   <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.304 on 3133 degrees of freedom
##   (258 observations deleted due to missingness)
## Multiple R-squared:  0.6829,	Adjusted R-squared:  0.6828 
## F-statistic:  6749 on 1 and 3133 DF,  p-value: < 2.2e-16
```
]]

```r
linear_reg() %>%
  set_engine("lm") %>%
  fit(Height_in ~ Width_in, data = pp) %>%
  tidy()
```

```
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    3.62    0.254        14.3 8.82e-45
## 2 Width_in       0.781   0.00950      82.1 0
```
]]

---

## A note on summarizing model output

- Models fit with base R can be summarized with `tidy()` as well
- So you don't have to use the full tidymodels pipeline to benefit from the *tidy* output

```r
lm(Height_in ~ Width_in, data = pp) %>%
  tidy()
```

```r
linear_reg() %>%
  set_engine("lm") %>%
  fit(Height_in ~ Width_in, data = pp) %>%
  tidy()
```

---

## Tidy regression output

Achieved with functions from the `broom` package:

- `tidy`: Constructs a data frame that summarizes the model's statistical findings: coefficient estimates, *standard errors, test statistics, p-values*.

- `glance`: Constructs a concise one-row summary of the model. This typically contains values such as `$R^2$`, adjusted `$R^2$`, *and residual standard errors that are computed once for the entire model*.

- `augment`: Adds columns to the original data that was modeled. This includes predictions and residuals.

---

## Comparison: `glance()`

**They're the same!**

```r
glance(base_lm)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>
## 1     0.683         0.683  8.30     6749.       0     1
## # … with 6 more variables: logLik <dbl>, AIC <dbl>,
## #   BIC <dbl>, deviance <dbl>, df.residual <int>,
## #   nobs <int>
```

```r
glance(base_lm)$r.squared
```

```
## [1] 0.6829468
```
]]

```r
glance(tidy_lm)
```

```r
glance(tidy_lm)$r.squared
```

```
## [1] 0.6829468
```
]]

---

## Comparison: `augment()`

**They're not exactly the same!**

```r
base_lm_aug <- augment(base_lm)
base_lm_aug
```

```
## # A tibble: 3,135 x 9
##   .rownames Height_in Width_in .fitted .resid     .hat
##   <chr>         <dbl>    <dbl>   <dbl>  <dbl>    <dbl>
## 1 1                37     29.5    26.7  10.3  0.000399
## 2 2                18     14      14.6   3.45 0.000396
## 3 3                13     16      16.1  -3.11 0.000361
## 4 4                14     18      17.7  -3.68 0.000337
## 5 5                14     18      17.7  -3.68 0.000337
## 6 6                 7     10      11.4  -4.43 0.000498
## # … with 3,129 more rows, and 3 more variables:
## #   .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
```
]]

```r
tidy_lm_aug <- augment(tidy_lm$fit)
tidy_lm_aug
```

---

# Motivation

---

## Provide similar interfaces to models

- Q: How do you define the the number of trees when fitting a random forest model?

--
- A: Depends on the package: `randomForest::randomForest()` uses `ntree`, `ranger::ranger()` uses `num.trees`, Spark's `sparklyr::ml_random_forest()` uses `num_trees`.

--
- A, with tidymodels:

```r
rf_mod <- rand_forest(trees = 2000)
```

---

## Help users avoid pitfalls

<br>

> *"Many models, especially complex predictive or machine learning models, can work very well on the data at hand but may fail when exposed to new data. Often, this issue is due to poor choices made during the development and/or selection of the models. Whenever possible, [tidymodels] software, documentation, and other materials attempt to prevent these and other pitfalls."* [[Source](https://www.tmwr.org/index.html)]

<br>

--
- Tidymodels pipelines start with splitting data into training and testing sets, and facilitate keeping them separate
- Tidymodels is opioniated about which functions can be applied to testing data

---

# Pedagogy

---

## For simple `lm`, might hard to see the advantage

```r
lm(Height_in ~ Width_in, data = pp) %>%
  summary()
```

```r
linear_reg() %>%
  set_engine("lm") %>%
  fit(Height_in ~ Width_in, data = pp) %>%
  tidy()
```

---

## Though easier to see the advantage in the summary step

```r
lm(Height_in ~ Width_in, data = pp) %>%
  tidy()
```

```r
linear_reg() %>%
  set_engine("lm") %>%
  fit(Height_in ~ Width_in, data = pp) %>%
  tidy()
```

---

## Pedagogical advantages more clear as you move on to different models

```r
linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

logistic_reg() %>% 
  set_engine("glm") %>%
  set_mode("classification")

rand_forest() %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

decision_tree() %>% 
  set_engine("rpart") %>% 
  set_mode("regression")

...
```
]

--
.pull-right[

```r
svm_linear() %>% 
  set_engine("LiblineaR") %>% 
  set_mode("regression") 
```

```
## Linear Support Vector Machine Specification (regression)
## 
## Computational engine: LiblineaR
```

```r
svm_linear() %>% 
  set_engine("LiblineaR") %>% 
  set_mode("classification")
```

```
## Linear Support Vector Machine Specification (classification)
## 
## Computational engine: LiblineaR
```
]

---

## Build on dplyr pipelines to build recipes

```r
email %>%
  select(-from, -sent_email) %>%
  mutate(
    day_of_week = lubridate::wday(time),
    month = lubridate::month(time)
  ) %>%
  select(-time) %>%
  mutate(
    cc = cut(cc, breaks = c(0, 1)),
    attach = cut(attach, breaks = c(0, 1)),
    dollar = cut(dollar, breaks = c(0, 1))
  ) %>%
  mutate(
    inherit = cut(inherit, breaks = c(0, 1, 5, 10, 20)),
    password = cut(password, breaks = c(0, 1, 5, 10, 20))
  ) %>%
  ...
```
]

--
.pull-right[

```r
recipe(spam ~ ., data = email) %>%
  step_rm(from, sent_email) %>%
  step_date(
    time, 
    features = c("dow", "month")
    ) %>%
  step_rm(time) %>%
  step_cut(
    cc, 
    attach, 
    dollar, breaks = c(0, 1)
    ) %>%
  step_cut(
    inherit, 
    password, breaks = c(0, 1, 5, 10, 20)
    ) %>%
  ...
```
]

---

## .larger[.dark-blue[.hand[Case studies]]]

---

# Case study 1: Predicting book weights

---

## Data: Book weight and volume

The `allbacks` data frame gives measurements on the volume and weight of 15 books, some of which are paperback and some of which are hardback (a.k.a. hardcover)

.pull-left[
- Volume - cubic centimeters
- Area - square centimeters
- Weight - grams
]
.pull-right[
.small[

```
## Loading required package: lattice
```

```
## # A tibble: 15 x 4
##    volume  area weight cover
##     <dbl> <dbl>  <dbl> <fct>
##  1    885   382    800 hb   
##  2   1016   468    950 hb   
##  3   1125   387   1050 hb   
##  4    239   371    350 hb   
##  5    701   371    750 hb   
##  6    641   367    600 hb   
##  7   1228   396   1075 hb   
##  8    412     0    250 pb   
##  9    953     0    700 pb   
## 10    929     0    650 pb   
## 11   1492     0    975 pb   
## 12    419     0    350 pb   
## 13   1010     0    950 pb   
## 14    595     0    425 pb   
## 15   1034     0    725 pb
```
]
]

.footnote[
.small[
These books are from the bookshelf of J. H. Maindonald at Australian National University.
]
]

---

## Book weight vs. volume

---

## Book weight vs. volume and cover

---

- Go to RStudio Cloud: [bit.ly/tidymodels-cloud](https://bit.ly/tidymodels-cloud)
- Join the space and click on **Projects** on top
- **Start** the project titled **Books**
- Open the R Markdown document called `books.Rmd` and follow the instructions
]

---

# Case study 2: Predicting spam

---

## Spam filters

.pull-left-narrow[
- Data from 3921 emails and 21 variables on them
- Outcome: whether the email is spam or not
- Predictors: number of characters, whether the email had "Re:" in the subject, time at which email was sent, number of times the word "inherit" shows up in the email, etc.
]
.pull-right-wide[
.small[

```r
library(openintro)
glimpse(email)
```

```
## Rows: 3,921
## Columns: 21
## $ spam         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ to_multiple  <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,…
## $ from         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ cc           <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
## $ sent_email   <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,…
## $ time         <dttm> 2012-01-01 01:16:41, 2012-01-01 02…
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ attach       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ dollar       <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ winner       <fct> no, no, no, no, no, no, no, no, no,…
## $ inherit      <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ password     <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0,…
## $ num_char     <dbl> 11.370, 10.504, 7.773, 13.256, 1.23…
## $ line_breaks  <int> 202, 202, 192, 255, 29, 25, 193, 23…
## $ format       <fct> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0,…
## $ re_subj      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ urgent_subj  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exclaim_mess <dbl> 0, 1, 6, 48, 1, 1, 1, 18, 1, 0, 2, …
## $ number       <fct> big, small, small, small, none, non…
```
]
]

---

---

```
## Picking joint bandwidth of 1.18
```

<img src="tidy-up-models_files/figure-html/unnamed-chunk-38-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[

```
## # A tibble: 2 x 2
##   spam  mean_num_char
##   <fct>         <dbl>
## 1 0             11.3 
## 2 1              5.44
```
]

---

.question[
Would you expect emails that have subjects starting with "Re:", "RE:", "re:", or "rE:" to be spam or not?
]

---

.large[
.hand[then we introduce logistic regression and generalized linear models (GLMs) in general, build a few logistic regression models with one or two predictors, interpret results...]
]

---

- Go to RStudio Cloud: [bit.ly/tidymodels-cloud](https://bit.ly/tidymodels-cloud)
- Join the space (if you haven't yet done so) and click on **Projects** on top
- **Start** the project titled **Spam**
- Open the R Markdown document called `spam.Rmd` and follow along!
]

---

## Predicting spam

```
## # A tibble: 22 x 5
##    term         estimate std.error statistic  p.value
##    <chr>           <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)  -9.09e+1   9.80e+3  -0.00928 9.93e- 1
##  2 to_multiple1 -2.68e+0   3.27e-1  -8.21    2.25e-16
##  3 from1        -2.19e+1   9.80e+3  -0.00224 9.98e- 1
##  4 cc            1.88e-2   2.20e-2   0.855   3.93e- 1
##  5 sent_email1  -2.07e+1   3.87e+2  -0.0536  9.57e- 1
##  6 time          8.48e-8   2.85e-8   2.98    2.92e- 3
##  7 image        -1.78e+0   5.95e-1  -3.00    2.73e- 3
##  8 attach        7.35e-1   1.44e-1   5.09    3.61e- 7
##  9 dollar       -6.85e-2   2.64e-2  -2.59    9.64e- 3
## 10 winneryes     2.07e+0   3.65e-1   5.67    1.41e- 8
## 11 inherit       3.15e-1   1.56e-1   2.02    4.32e- 2
## 12 viagra        2.84e+0   2.22e+3   0.00128 9.99e- 1
## 13 password     -8.54e-1   2.97e-1  -2.88    4.03e- 3
## 14 num_char      5.06e-2   2.38e-2   2.13    3.35e- 2
## 15 line_breaks  -5.49e-3   1.35e-3  -4.06    4.91e- 5
## 16 format1      -6.14e-1   1.49e-1  -4.14    3.53e- 5
## 17 re_subj1     -1.64e+0   3.86e-1  -4.25    2.16e- 5
## 18 exclaim_subj  1.42e-1   2.43e-1   0.585   5.58e- 1
## 19 urgent_subj1  3.88e+0   1.32e+0   2.95    3.18e- 3
## 20 exclaim_mess  1.08e-2   1.81e-3   5.98    2.23e- 9
## 21 numbersmall  -1.19e+0   1.54e-1  -7.74    9.62e-15
## 22 numberbig    -2.95e-1   2.20e-1  -1.34    1.79e- 1
```
]
]
.panel[.panel-name[Code]

```r
logistic_reg() %>%
  set_engine("glm") %>%
  fit(spam ~ ., data = email, family = "binomial") %>%
  tidy() %>%
  print(n = 22)
```

```
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
```
]
]

---

## Prediction

- The mechanics of prediction is **easy**:
  - Plug in values of predictors to the model equation
  - Calculate the predicted value of the response variable, `$\hat{y}$`

--
- Getting it right is **hard**!
  - There is no guarantee the model estimates you have are correct
  - Or that your model will perform as well with new data as it did with your sample data

---

## Spending our data

- Several steps to create a useful model: parameter estimation, model selection, performance assessment, etc.

- Doing all of this on the entire data we have available can lead to **overfitting**

- Allocate specific subsets of data for different tasks, as opposed to allocating the largest possible amount to the model parameter estimation only (which is what we've done so far)

---

## Splitting data

- **Training set:**
  - Sandbox for model building 
  - Spend most of your time using the training set to develop the model
  - Majority of the data (usually 80%)
  
- **Testing set:**
  - Held in reserve to determine efficacy of one or two chosen models
  - Critical to look at it once, otherwise it becomes part of the modeling process
  - Remainder of the data (usually 20%)
  
---

## Performing the split

```r
# Fix random numbers by setting the seed 
# Enables analysis to be reproducible when random numbers are used 
set.seed(1116)

# Put 80% of the data into the training set 
email_split <- initial_split(email, prop = 0.80)

# Create data frames for the two sets:
train_data <- training(email_split)
test_data  <- testing(email_split)
```

---

## Peek at the split

```r
glimpse(train_data)
```

```
## Rows: 3,136
## Columns: 21
## $ spam         <fct> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ to_multiple  <fct> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,…
## $ from         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ cc           <int> 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sent_email   <fct> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ time         <dttm> 2012-01-25 17:46:55, 2012-01-03 00…
## $ image        <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ attach       <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ dollar       <dbl> 10, 0, 0, 0, 0, 0, 13, 0, 0, 0, 2, …
## $ winner       <fct> no, no, no, no, no, no, no, yes, no…
## $ inherit      <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ password     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ num_char     <dbl> 23.308, 1.162, 4.732, 42.238, 1.228…
## $ line_breaks  <int> 477, 2, 127, 712, 30, 674, 367, 226…
## $ format       <fct> 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1,…
## $ re_subj      <fct> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,…
## $ urgent_subj  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exclaim_mess <dbl> 12, 0, 2, 2, 2, 31, 2, 0, 0, 1, 0, …
## $ number       <fct> small, none, big, big, small, small…
```
]
.pull-right[

```r
glimpse(test_data)
```

```
## Rows: 785
## Columns: 21
## $ spam         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ to_multiple  <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ from         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ cc           <int> 0, 1, 0, 1, 4, 0, 0, 0, 0, 0, 0, 0,…
## $ sent_email   <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ time         <dttm> 2012-01-01 12:55:06, 2012-01-01 14…
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ attach       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ dollar       <dbl> 0, 0, 5, 0, 0, 0, 0, 5, 4, 0, 0, 0,…
## $ winner       <fct> no, no, no, no, no, no, no, no, no,…
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ password     <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ num_char     <dbl> 4.837, 15.075, 18.037, 45.842, 11.4…
## $ line_breaks  <int> 193, 354, 345, 881, 125, 24, 296, 1…
## $ format       <fct> 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0,…
## $ re_subj      <fct> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,…
## $ exclaim_subj <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ urgent_subj  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exclaim_mess <dbl> 1, 10, 20, 5, 2, 0, 0, 0, 6, 0, 0, …
## $ number       <fct> big, small, small, big, small, none…
```
]
]

---

## Feature engineering

- We prefer simple models when possible, but **parsimony** does not mean sacrificing accuracy (or predictive performance) in the interest of simplicity

--
- Variables that go into the model and how they are represented are just as critical to success of the model

--
- **Feature engineering** allows us to get creative with our predictors in an effort to make them more useful for our model (to increase its predictive performance)

---

## A simple approach: `mutate()`

```r
library(lubridate)

train_data %>%
  mutate(
    date = date(time),
    dow  = wday(time),
    month = month(time)
    ) %>%
  select(time, date, dow, month) %>%
  sample_n(size = 5) # shuffle to show a variety
```

```
## # A tibble: 5 x 4
##   time                date         dow month
##   <dttm>              <date>     <dbl> <dbl>
## 1 2012-03-15 14:51:35 2012-03-15     5     3
## 2 2012-03-03 09:24:02 2012-03-03     7     3
## 3 2012-01-18 11:55:23 2012-01-18     4     1
## 4 2012-02-24 23:08:59 2012-02-24     6     2
## 5 2012-01-11 08:18:51 2012-01-11     4     1
```

---

## Modeling workflow, revisited

- Create a **recipe** for feature engineering steps to be applied to the training data

--
- Fit the model to the training data after these steps have been applied

--
- Using the model estimates from the training data, predict outcomes for the test data

--
- Evaluate the performance of the model on the test data

---

## You gotta keep them separated!

- We might not want to use the data we have as is

- We might want to "engineer features", omit certain variables, etc.

- And we need a way of doing this in a way that can be applied to both training and testing data, ideally without making us do extra work

- Enter **recipes** to help with this!

---

## Initiate a recipe

```r
email_rec <- recipe(
  spam ~ .,          # formula
  data = train_data  # data to use for cataloguing names and types of variables
  )

summary(email_rec)
```

```
## # A tibble: 21 x 4
##    variable     type    role      source  
##    <chr>        <chr>   <chr>     <chr>   
##  1 to_multiple  nominal predictor original
##  2 from         nominal predictor original
##  3 cc           numeric predictor original
##  4 sent_email   nominal predictor original
##  5 time         date    predictor original
##  6 image        numeric predictor original
##  7 attach       numeric predictor original
##  8 dollar       numeric predictor original
##  9 winner       nominal predictor original
## 10 inherit      numeric predictor original
## 11 viagra       numeric predictor original
## 12 password     numeric predictor original
## 13 num_char     numeric predictor original
## 14 line_breaks  numeric predictor original
## 15 format       nominal predictor original
## 16 re_subj      nominal predictor original
## 17 exclaim_subj numeric predictor original
## 18 urgent_subj  nominal predictor original
## 19 exclaim_mess numeric predictor original
## 20 number       nominal predictor original
## 21 spam         nominal outcome   original
```
]

---

## Remove certain variables

```r
email_rec <- email_rec %>%
  step_rm(from, sent_email)
```

```
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         20
## 
## Operations:
## 
## Delete terms from, sent_email
```
]

---

## Feature engineer date

```r
email_rec <- email_rec %>%
  step_date(time, features = c("dow", "month")) %>%
  step_rm(time)
```

---

## Discretize numeric variables

```r
email_rec <- email_rec %>%
  step_cut(cc, attach, dollar, breaks = c(0, 1)) %>%
  step_cut(inherit, password, breaks = c(0, 1, 5, 10, 20))
```

```
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         20
## 
## Operations:
## 
## Delete terms from, sent_email
## Date features from time
## Delete terms time
## Cut numeric for cc, attach, dollar
## Cut numeric for inherit, password
```
]

---

## Create dummy variables

```r
email_rec <- email_rec %>%
  step_dummy(all_nominal(), -all_outcomes())
```

---

## Remove zero variance variables

Variables that contain only a single value

```r
email_rec <- email_rec %>%
  step_zv(all_predictors())
```

---

## All in one place

```r
email_rec <- recipe(spam ~ ., data = email) %>%
  step_rm(from, sent_email) %>%
  step_date(time, features = c("dow", "month")) %>%               
  step_rm(time) %>%
  step_cut(cc, attach, dollar, breaks = c(0, 1)) %>%
  step_cut(inherit, password, breaks = c(0, 1, 5, 10, 20)) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_zv(all_predictors())
```

---

## What's next?

- We have our data split into testing and training
- We know the recipe steps we want to apply to our testing and training data
- We now need to define the model to fit and collect model results
- Lots of bookkeeping is needed to make sure we're not training our model on testing data
- Enter **workflows** to help with this!

---

## Define model

```r
email_mod <- logistic_reg() %>% 
  set_engine("glm")

email_mod
```

```
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
```

---

## Define workflow

**Workflows** bring together models and recipes so that they can be easily applied to both the training and test data.

```r
email_wflow <- workflow() %>% 
  add_model(email_mod) %>% 
  add_recipe(email_rec)
```

```
## ══ Workflow ══════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ──────────────────────────────────────────
## 7 Recipe Steps
## 
## • step_rm()
## • step_date()
## • step_rm()
## • step_cut()
## • step_cut()
## • step_dummy()
## • step_zv()
## 
## ── Model ─────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
```
]

---

## Fit model to training data

```r
email_fit <- email_wflow %>% 
  fit(data = train_data)
```

---

```r
tidy(email_fit) %>% print(n = 31)
```

```
## # A tibble: 31 x 5
##    term              estimate std.error statistic  p.value
##    <chr>                <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)       -0.892     2.49e-1   -3.58   3.37e- 4
##  2 image             -1.65      9.34e-1   -1.76   7.77e- 2
##  3 viagra             2.28      1.82e+2    0.0125 9.90e- 1
##  4 num_char           0.0470    2.44e-2    1.93   5.36e- 2
##  5 line_breaks       -0.00510   1.38e-3   -3.69   2.28e- 4
##  6 exclaim_subj      -0.204     2.77e-1   -0.736  4.62e- 1
##  7 exclaim_mess       0.00885   1.86e-3    4.75   1.99e- 6
##  8 to_multiple_X1    -2.60      3.54e-1   -7.35   2.06e-13
##  9 cc_X.1.68.        -0.312     4.90e-1   -0.638  5.24e- 1
## 10 attach_X.1.21.     2.05      3.68e-1    5.58   2.45e- 8
## 11 dollar_X.1.64.     0.214     2.17e-1    0.988  3.23e- 1
## 12 winner_yes         2.18      4.28e-1    5.08   3.68e- 7
## 13 inherit_X.1.5.    -9.21      7.65e+2   -0.0120 9.90e- 1
## 14 inherit_X.5.10.    2.51      1.44e+0    1.74   8.12e- 2
## 15 password_X.1.5.   -1.71      7.48e-1   -2.28   2.24e- 2
## 16 password_X.5.10. -12.5       4.75e+2   -0.0263 9.79e- 1
## 17 password_X.10.2… -13.7       8.14e+2   -0.0168 9.87e- 1
## 18 password_X.20.2… -13.9       1.03e+3   -0.0135 9.89e- 1
## 19 format_X1         -0.916     1.59e-1   -5.77   7.79e- 9
## 20 re_subj_X1        -2.90      4.37e-1   -6.65   2.95e-11
## 21 urgent_subj_X1     3.52      1.08e+0    3.25   1.15e- 3
## 22 number_small      -0.895     1.67e-1   -5.35   8.75e- 8
## 23 number_big        -0.199     2.50e-1   -0.797  4.25e- 1
## 24 time_dow_Mon       0.0441    2.96e-1    0.149  8.82e- 1
## 25 time_dow_Tue       0.371     2.67e-1    1.39   1.64e- 1
## 26 time_dow_Wed      -0.133     2.72e-1   -0.488  6.26e- 1
## 27 time_dow_Thu       0.0392    2.77e-1    0.141  8.88e- 1
## 28 time_dow_Fri       0.0488    2.80e-1    0.174  8.62e- 1
## 29 time_dow_Sat       0.253     2.98e-1    0.849  3.96e- 1
## 30 time_month_Feb     0.784     1.80e-1    4.35   1.37e- 5
## 31 time_month_Mar     0.541     1.81e-1    2.99   2.79e- 3
```
]

---

## Make predictions for test data

```r
email_pred <- predict(email_fit, test_data, type = "prob") %>% 
  bind_cols(test_data)

email_pred
```

```
## # A tibble: 785 x 23
##   .pred_0 .pred_1 spam  to_multiple from     cc sent_email
##     <dbl>   <dbl> <fct> <fct>       <fct> <int> <fct>     
## 1   0.995 4.70e-3 0     1           1         0 1         
## 2   0.999 1.34e-3 0     0           1         1 1         
## 3   0.967 3.28e-2 0     0           1         0 0         
## 4   0.999 7.76e-4 0     0           1         1 0         
## 5   0.994 6.42e-3 0     0           1         4 0         
## 6   0.860 1.40e-1 0     0           1         0 0         
## # … with 779 more rows, and 16 more variables:
## #   time <dttm>, image <dbl>, attach <dbl>, dollar <dbl>,
## #   winner <fct>, inherit <dbl>, viagra <dbl>,
## #   password <dbl>, num_char <dbl>, line_breaks <int>,
## #   format <fct>, re_subj <fct>, exclaim_subj <dbl>,
## #   urgent_subj <fct>, exclaim_mess <dbl>, number <fct>
```

---

## Evaluate the performance

```r
email_pred %>%
  roc_curve(
    truth = spam,
    .pred_1,
    event_level = "second"
  ) %>%
  autoplot()
```
]
.pull-right[
<img src="tidy-up-models_files/figure-html/unnamed-chunk-64-1.png" width="100%" style="display: block; margin: auto;" />
]

---

## Evaluate the performance

```r
email_pred %>%
  roc_auc(
    truth = spam,
    .pred_1,
    event_level = "second"
  )
```

```
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.857
```
]
.pull-right[
<img src="tidy-up-models_files/figure-html/unnamed-chunk-66-1.png" width="100%" style="display: block; margin: auto;" />
]

---

---

## .larger[.dark-blue[.hand[Wrap up]]]

---

# What about inference?

---

## Suppose...

Estimate the difference between the average evaluation score of male and female faculty.

```r
library(openintro)
evals %>%
  select(score, rank, ethnicity, gender, bty_avg)
```

```
## # A tibble: 463 x 5
##   score rank         ethnicity    gender bty_avg
##   <dbl> <fct>        <fct>        <fct>    <dbl>
## 1   4.7 tenure track minority     female       5
## 2   4.1 tenure track minority     female       5
## 3   3.9 tenure track minority     female       5
## 4   4.8 tenure track minority     female       5
## 5   4.6 tenured      not minority male         3
## 6   4.3 tenured      not minority male         3
## # … with 457 more rows
```
]

---

```r
evals %>%
  specify(score ~ gender) %>%
  generate(reps = 1000, 
    type = "bootstrap") %>%
  calculate(stat = "diff in means", 
    order = c("male", "female")) %>%
  summarise(
    l = quantile(stat, 0.025), 
    u = quantile(stat, 0.975)
    )
```

```
## # A tibble: 1 x 2
##        l     u
##    <dbl> <dbl>
## 1 0.0445 0.243
```
]
]
.pull-right[
.midi[

```r
t.test(evals$score ~ evals$gender)
```

```
## 
## 	Welch Two Sample t-test
## 
## data:  evals$score by evals$gender
## t = -2.7507, df = 398.7, p-value = 0.006218
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -0.24264375 -0.04037194
## sample estimates:
## mean in group female   mean in group male 
##             4.092821             4.234328
```
]
]

---

## infer `$\in$` tidymodels

.pull-left[
The objective of **infer** is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse design framework.
]
.pull-right[
<img src="images/infer.png" title="Hex sticker for the infer package." alt="Hex sticker for the infer package." width="60%" style="display: block; margin: auto;" />
]

---

```r
evals %>%
  specify(score ~ gender)
```

```
## Response: score (numeric)
## Explanatory: gender (factor)
## # A tibble: 463 x 2
##   score gender
##   <dbl> <fct> 
## 1   4.7 female
## 2   4.1 female
## 3   3.9 female
## 4   4.8 female
## 5   4.6 male  
## 6   4.3 male  
## # … with 457 more rows
```
]

---

```r
set.seed(1234)
evals %>%
  specify(score ~ gender) %>%
  generate(reps = 1000, type = "bootstrap")
```

```
## Response: score (numeric)
## Explanatory: gender (factor)
## # A tibble: 463,000 x 3
## # Groups:   replicate [1,000]
##   replicate score gender
##       <int> <dbl> <fct> 
## 1         1   4   female
## 2         1   3.1 male  
## 3         1   5   male  
## 4         1   4.4 male  
## 5         1   3.5 female
## 6         1   4.5 female
## # … with 462,994 more rows
```
]

---

```r
set.seed(1234)
evals %>%
  specify(score ~ gender) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "diff in means", order = c("male", "female"))
```

```
## # A tibble: 1,000 x 2
##   replicate  stat
##       <int> <dbl>
## 1         1 0.230
## 2         2 0.134
## 3         3 0.100
## 4         4 0.230
## 5         5 0.128
## 6         6 0.201
## # … with 994 more rows
```
]

---

```r
set.seed(1234)
evals %>%
  specify(score ~ gender) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "diff in means", order = c("male", "female")) %>%
  visualise()
```

<img src="tidy-up-models_files/figure-html/infer-4-1.png" width="60%" style="display: block; margin: auto;" />
]

---

```r
set.seed(1234)
evals %>%
  specify(score ~ gender) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "diff in means", order = c("male", "female")) %>%
  summarise(l = quantile(stat, 0.025), u = quantile(stat, 0.975))
```

```
## # A tibble: 1 x 2
##        l     u
##    <dbl> <dbl>
## 1 0.0407 0.236
```
]

---

## Learning resources

📄 **Learn** more and go further: [tidymodels.org/learn](https://www.tidymodels.org/learn/)

📘 **Tidy Modeling with R** by Max Kuhn & Julia Silge: [tmwr.org](https://www.tmwr.org/)
]

---

## Teaching resources

**Data Science in a Box** ([datasciencebox.org](https://datasciencebox.org/)) contains slides, application exercises, computing labs, and homework assignments on modelling and inference with tidymodels:

.midi[
- Slides and application exercises:
  - [Modelling data](https://datasciencebox.org/making-rigorous-conclusions.html#modelling-data)
  - [Classification and model building](https://datasciencebox.org/making-rigorous-conclusions.html#classification-and-model-building)
  - [Model validation](https://datasciencebox.org/making-rigorous-conclusions.html#model-validation)
  - [Uncertainty quantification](https://datasciencebox.org/making-rigorous-conclusions.html#uncertainty-quantification)
- [Labs](https://datasciencebox.org/making-rigorous-conclusions.html#labs-2)
  - Grading the professor: Fitting and interpreting simple linear regression models
  - Smoking while pregnant: Constructing confidence intervals, conducting hypothesis tests, and interpreting results in context of the data
- [Homework assignments](https://datasciencebox.org/making-rigorous-conclusions.html#homework-assignments-1):
  - Bike rentals in DC: Exploratory data analysis and fitting and interpreting models
  - Exploring the GSS: Fitting and interpreting models
  - Modelling the GSS: Model validation and inference
]

---