tidyverse to tidymodels

# tidyverse to tidymodels
### <a href="https://introds.org/">introds.org</a>
### Dr. Mine Çetinkaya-Rundel UoE | RStudio | Duke

---

## tidyverse

- The **tidyverse** is an opinionated collection of R packages designed for data science
- All packages share an underlying philosophy and a common grammar
]

---

---

## tidymodels

- The **tidymodels** framework is a collection of packages for modeling and machine learning using **tidyverse** principles.
- All packages share an underlying philosophy and a common grammar
]

---

---

## Data science cycle

---

---

# Import

---

## 🏁 Start with

---

## End with 🛑

```
## # A tibble: 207 x 6
## title date location abstract text url 
## <chr> <date> <chr> <chr> <chr> <chr> 
## 1 Coronavi… 2021-03-09 Scottish … Statement g… "Presid… https:/…
## 2 Coronavi… 2021-03-05 Scottish … Parliamenta… "Hello.… https:/…
## 3 Coronavi… 2021-03-04 Scottish … Parliamenta… "I will… https:/…
## 4 Coronavi… 2021-03-02 Scottish … Statement g… "Presid… https:/…
## 5 Coronavi… 2021-02-25 Scottish … Statement g… "I will… https:/…
## 6 Coronavi… 2021-02-24 St Andrew… Statement g… "\nGood… https:/…
## 7 Coronavi… 2021-02-23 Scottish … Statement g… "Presid… https:/…
## 8 Coronavi… 2021-02-22 St Andrew… Statement g… "\nGood… https:/…
## 9 Coronavi… 2021-02-18 St Andrew… Statement g… "\nAs u… https:/…
## 10 Coronavi… 2021-02-17 Scottish … Parliamenta… "I will… https:/…
## 11 Coronavi… 2021-02-16 Scottish … Parliamenta… "Presid… https:/…
## 12 Coronavi… 2021-02-15 St Andrew… Statement g… "\nGood… https:/…
## 13 Coronavi… 2021-02-11 St Andrew… Statement g… "\nGood… https:/…
## 14 Coronavi… 2021-02-10 Scottish … Parliamenta… "I will… https:/…
## 15 Coronavi… 2021-02-09 St Andrew… Statement g… "\nFirs… https:/…
## # … with 192 more rows
```

---

#### .center[
[www.gov.scot/collections/first-ministers-speeches](https://www.gov.scot/collections/first-ministers-speeches/)
]

---

---

## Plan: Get data from a single page

1. Scrape `title`, `date`, `location`, `abstract`, and `text` from a few COVID-19 speech pages to develop the code

2. Write a function that scrapes `title`, `date`, `location`, `abstract`, and `text` from COVID-19 speech pages

3. Scrape the `url`s of COVID-19 speeches from the main page

4. Use this function to scrape from each individual COVID-19 speech from these `url`s and create a data frame with the columns `title`, `date`, `location`, `abstract`, `text`, and `url`

---

## rvest

.pull-left[
- The **rvest** package makes basic processing and manipulation of HTML data straight forward
- It's designed to work with pipelines built with `%>%`

```r
library(rvest)
```

]
.pull-right[
<img src="img/rvest.png" width="230" style="display: block; margin: auto 0 auto auto;" />
]

---

## Read page for 26 Oct speech

```r
url <- "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-26-october/"
speech_page <- read_html(url)
```

```r
speech_page
```

```
## {html_document}
## <html dir="ltr" lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html ...
## [2] <body class="fontawesome site-header__container">\n\n\n\n\ ...
```
]
.pull-right[
<img src="img/fm-speech-oct-26.png" width="80%" style="display: block; margin: auto;" />
]

---

## Extract title

```r
title <- speech_page %>%
 html_node(".article-header__title") %>%
 html_text()

title
```

```
## [1] "Coronavirus (COVID-19) update: First Minister's speech 26 October"
```
]
.pull-right-narrow[
<img src="img/title.png" width="100%" style="display: block; margin: auto;" />
]

---

## Extract date

```r
library(lubridate)

speech_page %>%
    html_node(".content-data__list:nth-child(1) strong") %>%
    html_text() 
```

```
## [1] "26 Oct 2020"
```

```r
date <- speech_page %>%
 html_node(".content-data__list:nth-child(1) strong") %>%
 html_text() %>%
 dmy()
date
```

```
## [1] "2020-10-26"
```
]
.pull-right-narrow[
<img src="img/date.png" width="100%" style="display: block; margin: auto;" />
]

---

## Similarly...

extract location, abstract, and text

---

## Put it all in a data frame

```r
oct_26_speech <- tibble(
 title = title,
 date = date,
 location = location,
 abstract = abstract,
 text = text,
 url = url
)

oct_26_speech
```

```
## # A tibble: 1 x 6
## title date location abstract text url 
## <chr> <date> <chr> <chr> <lis> <chr> 
## 1 Coronaviru… 2020-10-26 St Andrew… Statement g… <chr… https://w…
```
]
.pull-right[
<img src="img/fm-speech-oct-26.png" width="75%" style="display: block; margin: auto;" />
]

---

## Plan: Get data from all pages

- Write a function that scrapes the data from a single page and returns a data frame with a single row for that page

- Obtain a list of URLs of all pages

- Map the function over the list of all URLs to obtain a data framw where each row is a single speech and the number of rows is the number of speeches

```
## # A tibble: 207 x 6
## title date location abstract text url 
## <chr> <date> <chr> <chr> <chr> <chr> 
## 1 Coronavi… 2021-03-09 Scottish … Statement g… "Presidi… https:/…
## 2 Coronavi… 2021-03-05 Scottish … Parliamenta… "Hello. … https:/…
## 3 Coronavi… 2021-03-04 Scottish … Parliamenta… "I will … https:/…
## 4 Coronavi… 2021-03-02 Scottish … Statement g… "Presidi… https:/…
## 5 Coronavi… 2021-02-25 Scottish … Statement g… "I will … https:/…
## 6 Coronavi… 2021-02-24 St Andrew… Statement g… "\nGood … https:/…
## # … with 201 more rows
```

---

## Write a function

```r
scrape_speech_scot <- function(url){
 
 speech_page <- read_html(url)
 
 title <- speech_page %>%
 html_node(".article-header__title") %>%
 html_text()
 
 date <- speech_page %>%
 html_node(".content-data__list:nth-child(1) strong") %>%
 html_text() %>%
 dmy()
 
 location <- speech_page %>%
 html_node(".content-data__list+ .content-data__list strong") %>%
 html_text()
 
 abstract <- speech_page %>%
 html_node(".leader--first-para p") %>%
 html_text()
 
 text <- speech_page %>% 
 html_nodes("#preamble p") %>%
 html_text() %>%
 glue_collapse(sep = " ") %>%
 as.character()
 
 tibble(
 title = title,
 date = date,
 location = location,
 abstract = abstract,
 text = text,
 url = url
 )
 
}
```
]

---

## Get a list of all URLs

```r
all_speeches_page_scot <- read_html("https://www.gov.scot/collections/first-ministers-speeches/")

covid_speech_urls_uk_scot <- all_speeches_page_scot %>%
 html_nodes(".collections-list a") %>%
 html_attr("href") %>%
 str_subset("covid-19") %>%
 str_c("https://www.gov.scot", .)
```

---

## Map the function over all URLs

```r
covid_speeches_scot <- map_dfr(covid_speech_urls_uk_scot, scrape_speech_scot)
```

```r
covid_speeches_scot
```

---

# Transform and visualise

---

## Filter for First minister speeches

```r
covid_speeches_scot <- covid_speeches_scot %>%
 filter(str_detect(abstract, "First Minister"))

covid_speeches_scot
```

```
## # A tibble: 204 x 6
## title date location abstract text url 
## <chr> <date> <chr> <chr> <chr> <chr> 
## 1 Coronavi… 2021-03-09 Scottish … Statement g… "Presidi… https:/…
## 2 Coronavi… 2021-03-05 Scottish … Parliamenta… "Hello. … https:/…
## 3 Coronavi… 2021-03-04 Scottish … Parliamenta… "I will … https:/…
## 4 Coronavi… 2021-03-02 Scottish … Statement g… "Presidi… https:/…
## 5 Coronavi… 2021-02-25 Scottish … Statement g… "I will … https:/…
## 6 Coronavi… 2021-02-24 St Andrew… Statement g… "\nGood … https:/…
## # … with 198 more rows
```

---

## Count number of words in each speech

```r
covid_speeches_scot <- covid_speeches_scot %>%
 rowwise() %>%
 mutate(n_words = text %>% str_count("\\w+") %>% sum()) %>%
 ungroup()

covid_speeches_scot
```

```
## # A tibble: 204 x 7
## title date location abstract text url n_words
## <chr> <date> <chr> <chr> <chr> <chr> <int>
## 1 Coronav… 2021-03-09 Scottish… Statemen… "Presi… https:… 3355
## 2 Coronav… 2021-03-05 Scottish… Parliame… "Hello… https:… 1199
## 3 Coronav… 2021-03-04 Scottish… Parliame… "I wil… https:… 561
## 4 Coronav… 2021-03-02 Scottish… Statemen… "Presi… https:… 2365
## 5 Coronav… 2021-02-25 Scottish… Statemen… "I wil… https:… 700
## 6 Coronav… 2021-02-24 St Andre… Statemen… "\nGoo… https:… 2466
## # … with 198 more rows
```

---

## Length of speech over time

.panelset[
.panel[.panel-name[Plot]
<img src="tidyverse-tidymodels_files/figure-html/words-over-time-1.png" width="60%" style="display: block; margin: auto;" />

]

```r
ggplot(covid_speeches_scot, aes(x = date, y = n_words)) +
  geom_point(alpha = 0.7) +
  geom_smooth(aes(x = date, y = n_words), method = lm, formula = y ~ x) +
  labs(
    title = "Length of Scotland COVID-19 speeches",
    subtitle = "Measured in number of words"
  )
```
]
]

---

## Length of speech over time, again

.panelset[
.panel[.panel-name[Plot]
<img src="tidyverse-tidymodels_files/figure-html/words-over-time-better-1.png" width="60%" style="display: block; margin: auto;" />
]

```r
# set color: https://www.schemecolor.com/flag-of-scotland-colors.php
*scotblue <- "#0065BF"

covid_speeches_scot %>%
  ggplot(aes(x = date, y = n_words)) +
* geom_point(alpha = 0.7, color = scotblue) +
* geom_smooth(aes(x = date, y = n_words), method = lm, formula = y ~ x,  color = "darkgray") +
  labs(
    title = "Length of Scotland COVID-19 speeches over time",
    subtitle = "Measured in number of words",
*   x = NULL,
*   y = "Number of words"
  ) +
* theme_minimal()
```
]
]

---

## tidytext

.pull-left[
- Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use
- Learn more at [tidytextmining.com](https://www.tidytextmining.com/)

```r
library(tidytext)
```

]
.pull-right[
<img src="img/tidytext.png" width="60%" style="display: block; margin: auto auto auto 0;" />
]

---

## Tokenize speeches by word

```
## # A tibble: 422,860 x 7
## date word title location abstract url n_words
## <date> <chr> <chr> <chr> <chr> <chr> <int>
## 1 2021-03-09 presi… Coronav… Scottish… Statement… https:… 3355
## 2 2021-03-09 offic… Coronav… Scottish… Statement… https:… 3355
## 3 2021-03-09 i Coronav… Scottish… Statement… https:… 3355
## 4 2021-03-09 will Coronav… Scottish… Statement… https:… 3355
## 5 2021-03-09 update Coronav… Scottish… Statement… https:… 3355
## 6 2021-03-09 parli… Coronav… Scottish… Statement… https:… 3355
## # … with 422,854 more rows
```
]

```r
covid_speeches_scot_words <- covid_speeches_scot %>%
 # make sure COVID-19 (and all its various spellings) don't get split
 # tidytext doesn't remove underscores
 # https://stackoverflow.com/questions/58281091/preserve-hyphenated-words-in-ngrams-analysis-with-tidytext
 mutate(
 text = str_replace_all(text, "COVID-19", "COVID_19"),
 text = str_replace_all(text, "COVID 19", "COVID_19"),
 text = str_replace_all(text, "Covid-19", "COVID_19"),
 text = str_replace_all(text, "Covid 19", "COVID_19")
 ) %>%
 unnest_tokens(word, text) %>%
 relocate(date, word)

covid_speeches_scot_words
```
]
]

---

## Find common words

.panelset[
.panel[.panel-name[Plot]
<img src="tidyverse-tidymodels_files/figure-html/scot-common-words-1.png" width="60%" style="display: block; margin: auto;" />

]

```r
covid_speeches_scot_words %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  slice_head(n = 15) %>%
  ggplot(aes(y = fct_reorder(word, n), x = n, fill = log(n))) +
  geom_col(show.legend = FALSE) +
  labs(
    title = "Commonly used words in Scotland COVID-19 briefings",
    y = NULL, x = "Frequency"
  ) +
  theme_minimal()
```
]
]

---

## Social vs. physical distancing

.panelset[
.panel[.panel-name[Plot]
<img src="tidyverse-tidymodels_files/figure-html/social-physical-1.png" width="60%" style="display: block; margin: auto;" />

]

```r
covid_speeches_scot %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  filter(str_detect(bigram, "social dist|physical dist")) %>%
  mutate(soc_phys = if_else(str_detect(bigram, "social"), "S", "P")) %>%
  count(date, soc_phys) %>%
  ggplot(aes(x = date, y = n, color = soc_phys)) +
  geom_text(aes(label = soc_phys)) +
  guides(color = FALSE) +
  labs(x = "Date", y = "Frequency",
       title = "Social (S) vs. physical (P) distancing",
       subtitle = "Number of mentions over time") +
  scale_color_manual(values = c(scotblue, "darkgray")) +
  scale_y_continuous(limits = c(0, 10), breaks = seq(0, 10, 2)) +
  theme_minimal()
```
]
]

---

## Vaccines

.panelset[
.panel[.panel-name[Plot]
<img src="tidyverse-tidymodels_files/figure-html/vaccines-1.png" width="60%" style="display: block; margin: auto;" />

]

```r
covid_speeches_scot_words %>%
  filter(str_detect(word, "[Vv]accin")) %>%
  count(date) %>%
  ggplot(aes(x = date, y = n)) +
  geom_text(aes(label = "💉", size = n), show.legend = FALSE) +
  labs(x = "Date", y = "Frequency",
       title = 'Number of times "vaccine" is mentioned in speech') +
  theme_minimal()
```
]
]

---

## Compare to UK

---

# Model

---

## Predicting UK vs. Scotland

```r
covid_speeches %>%
  count(origin)
```

```
## # A tibble: 2 x 2
## origin n
## * <fct> <int>
## 1 Scotland 207
## 2 UK 40
```

---

## Tokenize into sentences

```r
covid_speeches_sentences <- covid_speeches %>%
 unnest_tokens(sentence, text, token = "sentences")

covid_speeches_sentences %>%
  relocate(sentence)
```

```
## # A tibble: 20,445 x 4
## sentence speech_id date origin
## <chr> <chr> <date> <fct> 
## 1 presiding officer i will update pa… 1 2021-03-09 Scotl…
## 2 these changes relate to outdoor me… 1 2021-03-09 Scotl…
## 3 i will also announce a change whic… 1 2021-03-09 Scotl…
## 4 now while the changes i set out to… 1 2021-03-09 Scotl…
## 5 and they do represent gradual, but… 1 2021-03-09 Scotl…
## 6 next week, i will set out a firmer… 1 2021-03-09 Scotl…
## # … with 20,439 more rows
```

---

## Split into testing and training

```r
set.seed(1234)
covid_split <- initial_split(covid_speeches_sentences, strata = origin)
covid_train <- training(covid_split)
covid_test <- testing(covid_split)
```

---

## Specify a model

```r
lasso_mod <- logistic_reg(penalty = 0.005, mixture = 1) %>%
 set_engine("glmnet")
```

---

## Build a recipe

```r
library(textrecipes)

covid_rec <- recipe(origin ~ sentence, data = covid_train) %>%
 # tokenize into words
 step_tokenize(sentence, token = "words") %>%
 # filter out stop words
 step_stopwords(sentence) %>%
 # all the 1-grams followed by all the 2-grams followed by all the 3-grams
 step_ngram(sentence, num_tokens = 3, min_num_tokens = 1) %>%
 # keep the 500 most frequent words to avoid creating too many variables 
 step_tokenfilter(sentence, max_tokens = 500) %>%
 # calculate tf-idf
 step_tfidf(sentence)
```

---

## Build a workflow

```r
covid_wflow <- workflow() %>%
 add_model(lasso_mod) %>%
 add_recipe(covid_rec)
```

---

## Perform cross-validation

```r
set.seed(1234)
covid_folds <- vfold_cv(covid_train, v = 10, strata = origin)
```

```r
covid_folds
```

```
## # 10-fold cross-validation using stratification 
## # A tibble: 10 x 2
## splits id 
## <list> <chr> 
## 1 <split [13800/1534]> Fold01
## 2 <split [13800/1534]> Fold02
## 3 <split [13800/1534]> Fold03
## 4 <split [13800/1534]> Fold04
## 5 <split [13801/1533]> Fold05
## 6 <split [13801/1533]> Fold06
## # … with 4 more rows
```

---

## Fit resamples

```r
covid_fit_rs <- covid_wflow %>%
 fit_resamples(
 covid_folds,
 control = control_resamples(save_pred = TRUE)
 )
```

```r
covid_train_metrics <- collect_metrics(covid_fit_rs)
covid_train_metrics
```

```
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config 
## <chr> <chr> <dbl> <int> <dbl> <chr> 
## 1 accuracy binary 0.918 10 0.00234 Preprocessor1_Model1
## 2 roc_auc binary 0.784 10 0.00558 Preprocessor1_Model1
```

---

## ROC curve

```r
covid_train_pred <- collect_predictions(covid_fit_rs)

covid_train_pred %>%
  group_by(id) %>%
  roc_curve(truth = origin, .pred_Scotland) %>%
  autoplot() +
  labs(
    title = "ROC curve for Scotland & UK COVID speeches",
    subtitle = "Each resample fold is shown in a different color"
  )
```

<img src="tidyverse-tidymodels_files/figure-html/unnamed-chunk-52-1.png" width="60%" style="display: block; margin: auto;" />
]

---

## Make predictions for test data

```r
covid_fit <- covid_wflow %>%
 fit(data = covid_train)

covid_test_pred <- predict(covid_fit, new_data = covid_test, type = "prob") %>%
 bind_cols(covid_test %>% select(origin, speech_id, sentence))

covid_test_pred
```

```
## # A tibble: 5,111 x 5
## .pred_Scotland .pred_UK origin speech_id sentence 
## <dbl> <dbl> <fct> <chr> <chr> 
## 1 0.927 0.0726 Scotla… 1 i will though be spe…
## 2 0.961 0.0387 Scotla… 1 and the total number…
## 3 0.905 0.0950 Scotla… 1 and then in two week…
## 4 0.884 0.116 Scotla… 1 current plans for 23…
## 5 0.940 0.0599 Scotla… 1 in addition, i know …
## 6 0.940 0.0599 Scotla… 1 and so i think it is…
## # … with 5,105 more rows
```

---

## ROC curve

```r
covid_test_pred %>%
  roc_curve(truth = origin, .pred_Scotland) %>%
  autoplot()
```

---

## View predictions

```r
covid_test_pred %>%
  roc_auc(truth = origin, .pred_Scotland)
```

```
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.762
```

```r
covid_test_pred %>% 
  filter(origin == "Scotland", .pred_UK > 0.5)
```

```
## # A tibble: 14 x 5
## .pred_Scotland .pred_UK origin speech_id sentence 
## <dbl> <dbl> <fct> <chr> <chr> 
## 1 0.351 0.649 Scotla… 19 we must guard agains…
## 2 0.472 0.528 Scotla… 37 and if that happens,…
## 3 0.346 0.654 Scotla… 51 we are seeing simila…
## 4 0.354 0.646 Scotla… 58 we must drive the in…
## 5 0.408 0.592 Scotla… 68 the chancellors’ com…
## 6 0.470 0.530 Scotla… 105 i will announce furt…
## # … with 8 more rows
```

---

## What decisions did we make?

without thinking about it much...

- `step_tokenfilter(sentence, max_tokens = 500)` -- why 500 for max_tokens?
- `logistic_reg(penalty = 0.005, mixture = 1)` -- why 0.005 for penalty?

---

## Tuning

### Specify model

```r
## with penalty ??
lasso_mod_tune <- logistic_reg(penalty = tune(), mixture = 1) %>%
 set_engine("glmnet") %>% 
 set_mode("classification")
```

---

## Tuning

### Build recipe

```r
covid_rec_tune <- recipe(origin ~ sentence, data = covid_train) %>%
 step_tokenize(sentence, token = "words") %>%
 step_stopwords(sentence) %>%
 step_ngram(sentence, num_tokens = 3, min_num_tokens = 1) %>%
 # keep the ?? most frequent words to avoid creating too many variables 
 step_tokenfilter(sentence, max_tokens = tune(), min_times = 5) %>%
 step_tfidf(sentence)
```

---

## Tuning

### Build workflow

```r
covid_wflow_tune <- workflow() %>%
 add_model(lasso_mod_tune) %>%
 add_recipe(covid_rec_tune)
```

---

## Tuning

### Possible set of hyperparameters

```r
param_grid <- grid_regular(
 penalty(range = c(-4, 0)),
 max_tokens(range = c(500, 2000)),
 levels = 5
)

param_grid
```

```
## # A tibble: 25 x 2
## penalty max_tokens
## <dbl> <int>
## 1 0.0001 500
## 2 0.001 500
## 3 0.01 500
## 4 0.1 500
## 5 1 500
## 6 0.0001 875
## # … with 19 more rows
```

---

## Tuning

### Train models with all possible set of hyperparameters

```r
set.seed(24)
covid_fit_rs_tune <- tune_grid(
 covid_wflow_tune,
 resamples = covid_folds,
 grid = param_grid, 
 control = control_grid(save_pred = TRUE)
)
```

---

## Tuning

### Results

```r
covid_fit_rs_tune
```

```
## Warning: This tuning result has notes. Example notes on model fitting include:
## preprocessor 2/5, model 1/1 (predictions): partial argument match of 'along' to 'along.with'
## preprocessor 1/5, model 1/1 (predictions): partial argument match of 'along' to 'along.with'
## preprocessor 5/5, model 1/1 (predictions): partial argument match of 'along' to 'along.with'
```

```
## # Tuning results
## # 10-fold cross-validation using stratification 
## # A tibble: 10 x 5
## splits id .metrics .notes .predictions 
## <list> <chr> <list> <list> <list> 
## 1 <split [13800… Fold01 <tibble [50… <tibble [5… <tibble [38,350…
## 2 <split [13800… Fold02 <tibble [50… <tibble [5… <tibble [38,350…
## 3 <split [13800… Fold03 <tibble [50… <tibble [5… <tibble [38,350…
## 4 <split [13800… Fold04 <tibble [50… <tibble [5… <tibble [38,350…
## 5 <split [13801… Fold05 <tibble [50… <tibble [5… <tibble [38,325…
## 6 <split [13801… Fold06 <tibble [50… <tibble [5… <tibble [38,325…
## # … with 4 more rows
```

---

## Tuning

### Select best

```r
best_roc_auc <- select_best(covid_fit_rs_tune, "roc_auc")

best_roc_auc
```

```
## # A tibble: 1 x 3
## penalty max_tokens .config 
## <dbl> <int> <chr> 
## 1 0.001 2000 Preprocessor5_Model2
```

```r
covid_wflow_final <- finalize_workflow(covid_wflow_tune, best_roc_auc)
```

---

## Variable importance

```r
library(vip)

vi_data <- covid_wflow_final %>%
 fit(covid_train) %>%
 pull_workflow_fit() %>%
 vi(lambda = best_roc_auc$penalty) %>%
 mutate(Variable = str_remove_all(Variable, "tfidf_sentence_")) %>%
 filter(Importance != 0)
```

```r
vi_data
```

```
## # A tibble: 1,088 x 3
## Variable Importance Sign 
## <chr> <dbl> <chr>
## 1 covid_secure 14.9 POS 
## 2 scotland 8.29 NEG 
## 3 coronavirus 7.34 POS 
## 4 british 7.29 POS 
## 5 r 6.59 POS 
## 6 r_number 6.55 NEG 
## # … with 1,082 more rows
```

---

## Variable importance, visualised

.panelset[
.panel[.panel-name[Plot]
<img src="tidyverse-tidymodels_files/figure-html/vip-1.png" width="60%" style="display: block; margin: auto;" />

]

```r
vi_data %>%
  mutate(Importance = abs(Importance)) %>%
  filter(Importance != 0) %>%
  group_by(Sign) %>%
  slice_head(n = 20) %>%
  ungroup() %>%
  mutate(pred_origin = if_else(Sign == "POS", "UK", "Scotland")) %>% 
  ggplot(aes(x = Importance, y = fct_reorder(Variable, Importance), fill = pred_origin )) +
  geom_col(show.legend = FALSE) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_fill_manual(values = c(scotblue, ukred)) +
  facet_wrap(~pred_origin, scales = "free") +
  labs(y = NULL) +
  theme_minimal()
```
]
]

---

## Prediction

```r
scot_sentence$sentence
```

```
## [1] "all local authorities will - at least until easter - continue to observe the current requirement for 2m physical distancing in secondary schools."
```

```r
scot_sentence %>%
  tidytext::unnest_tokens(words, sentence) %>%
  left_join(vi_data, by = c("words" = "Variable")) %>%
  mutate(pred_origin = if_else(Sign == "NEG", "Scotland", "UK")) %>%
  print(n = 25)
```

```
## # A tibble: 21 x 7
## speech_id date origin words Importance Sign pred_origin
## <chr> <date> <fct> <chr> <dbl> <chr> <chr> 
## 1 4 2021-03-02 Scotl… all NA <NA> <NA> 
## 2 4 2021-03-02 Scotl… local 4.21 POS UK 
## 3 4 2021-03-02 Scotl… auth… 0.453 NEG Scotland 
## 4 4 2021-03-02 Scotl… will NA <NA> <NA> 
## 5 4 2021-03-02 Scotl… at NA <NA> <NA> 
## 6 4 2021-03-02 Scotl… least 1.34 POS UK 
## 7 4 2021-03-02 Scotl… until NA <NA> <NA> 
## 8 4 2021-03-02 Scotl… east… 0.0537 NEG Scotland 
## 9 4 2021-03-02 Scotl… cont… NA <NA> <NA> 
## 10 4 2021-03-02 Scotl… to NA <NA> <NA> 
## 11 4 2021-03-02 Scotl… obse… NA <NA> <NA> 
## 12 4 2021-03-02 Scotl… the NA <NA> <NA> 
## 13 4 2021-03-02 Scotl… curr… NA <NA> <NA> 
## 14 4 2021-03-02 Scotl… requ… NA <NA> <NA> 
## 15 4 2021-03-02 Scotl… for NA <NA> <NA> 
## 16 4 2021-03-02 Scotl… 2m NA <NA> <NA> 
## 17 4 2021-03-02 Scotl… phys… 0.478 NEG Scotland 
## 18 4 2021-03-02 Scotl… dist… NA <NA> <NA> 
## 19 4 2021-03-02 Scotl… in NA <NA> <NA> 
## 20 4 2021-03-02 Scotl… seco… 1.46 POS UK 
## 21 4 2021-03-02 Scotl… scho… 1.51 POS UK
```
]

---

## Prediction

```r
uk_sentence$sentence
```

```
## [1] "we have now vaccinated 1.26 million people in england, 113,000 in scotland, 49,000 in wales, and 46,000 in northern ireland."
```

```r
uk_sentence %>%
  tidytext::unnest_tokens(words, sentence) %>%
  left_join(vi_data, by = c("words" = "Variable")) %>%
  mutate(pred_origin = if_else(Sign == "NEG", "Scotland", "UK")) %>%
  print(n = 25)
```

```
## # A tibble: 20 x 7
## speech_id date origin words Importance Sign pred_origin
## <chr> <date> <fct> <chr> <dbl> <chr> <chr> 
## 1 8 2021-01-07 UK we NA <NA> <NA> 
## 2 8 2021-01-07 UK have NA <NA> <NA> 
## 3 8 2021-01-07 UK now 0.898 POS UK 
## 4 8 2021-01-07 UK vacc… 0.767 POS UK 
## 5 8 2021-01-07 UK 1.26 NA <NA> <NA> 
## 6 8 2021-01-07 UK mill… 1.72 POS UK 
## 7 8 2021-01-07 UK peop… 0.274 POS UK 
## 8 8 2021-01-07 UK in NA <NA> <NA> 
## 9 8 2021-01-07 UK engl… 3.96 POS UK 
## 10 8 2021-01-07 UK 113,… NA <NA> <NA> 
## 11 8 2021-01-07 UK in NA <NA> <NA> 
## 12 8 2021-01-07 UK scot… 8.29 NEG Scotland 
## 13 8 2021-01-07 UK 49,0… NA <NA> <NA> 
## 14 8 2021-01-07 UK in NA <NA> <NA> 
## 15 8 2021-01-07 UK wales NA <NA> <NA> 
## 16 8 2021-01-07 UK and NA <NA> <NA> 
## 17 8 2021-01-07 UK 46,0… NA <NA> <NA> 
## 18 8 2021-01-07 UK in NA <NA> <NA> 
## 19 8 2021-01-07 UK nort… 5.37 POS UK 
## 20 8 2021-01-07 UK irel… 0.202 NEG Scotland
```
]

---

## Acknowledgements & learn more

- Read the full case study, with code: https://github.com/mine-cetinkaya-rundel/fm-speeches-covid19

- Learn tidyverse: https://www.tidyverse.org/learn

- Learn tidymodels: https://www.tidymodels.org/learn

- Much of this was inspired by Julia Silge and Emil Hvitfeldt's useR tutorial:

https://emilhvitfeldt.github.io/useR2020-text-modeling-tutorial