class: center, middle, inverse, title-slide # tidyverse to tidymodels ###
introds.org
###
Dr. Mine Çetinkaya-Rundel
UoE | RStudio | Duke --- ## tidyverse .pull-left[ <img src="img/tidyverse.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ .center[.large[ [tidyverse.org](https://www.tidyverse.org/) ]] - The **tidyverse** is an opinionated collection of R packages designed for data science - All packages share an underlying philosophy and a common grammar ] --- class: middle <img src="img/tidyverse-packages.png" width="80%" style="display: block; margin: auto;" /> --- ## tidymodels .pull-left[ <img src="img/tidymodels.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ .center[.large[ [tidymodels.org](https://www.tidymodels.org/) ]] - The **tidymodels** framework is a collection of packages for modeling and machine learning using **tidyverse** principles. - All packages share an underlying philosophy and a common grammar ] --- class: middle <img src="img/tidymodels-packages.png" width="80%" style="display: block; margin: auto;" /> --- ## Data science cycle <img src="img/data-science.png" width="80%" style="display: block; margin: auto;" /> .footnote[ [R for Data Science](https://r4ds.had.co.nz/introduction.html), Grolemund and Wickham. ] --- class: middle <img src="img/fm-speech-oct-26.png" width="70%" style="display: block; margin: auto;" /> --- class: middle # Import --- ## 🏁 Start with <img src="img/fm-speeches.png" width="75%" style="display: block; margin: auto;" /> --- ## End with 🛑 ``` ## # A tibble: 207 x 6 ## title date location abstract text url ## <chr> <date> <chr> <chr> <chr> <chr> ## 1 Coronavi… 2021-03-09 Scottish … Statement g… "Presid… https:/… ## 2 Coronavi… 2021-03-05 Scottish … Parliamenta… "Hello.… https:/… ## 3 Coronavi… 2021-03-04 Scottish … Parliamenta… "I will… https:/… ## 4 Coronavi… 2021-03-02 Scottish … Statement g… "Presid… https:/… ## 5 Coronavi… 2021-02-25 Scottish … Statement g… "I will… https:/… ## 6 Coronavi… 2021-02-24 St Andrew… Statement g… "\nGood… https:/… ## 7 Coronavi… 2021-02-23 Scottish … Statement g… "Presid… https:/… ## 8 Coronavi… 2021-02-22 St Andrew… Statement g… "\nGood… https:/… ## 9 Coronavi… 2021-02-18 St Andrew… Statement g… "\nAs u… https:/… ## 10 Coronavi… 2021-02-17 Scottish … Parliamenta… "I will… https:/… ## 11 Coronavi… 2021-02-16 Scottish … Parliamenta… "Presid… https:/… ## 12 Coronavi… 2021-02-15 St Andrew… Statement g… "\nGood… https:/… ## 13 Coronavi… 2021-02-11 St Andrew… Statement g… "\nGood… https:/… ## 14 Coronavi… 2021-02-10 Scottish … Parliamenta… "I will… https:/… ## 15 Coronavi… 2021-02-09 St Andrew… Statement g… "\nFirs… https:/… ## # … with 192 more rows ``` --- #### .center[ [www.gov.scot/collections/first-ministers-speeches](https://www.gov.scot/collections/first-ministers-speeches/) ] <img src="img/fm-speeches-annotated.png" width="75%" style="display: block; margin: auto;" /> --- <img src="img/fm-speech-oct-26-annotated.png" width="65%" style="display: block; margin: auto;" /> --- ## Plan: Get data from a single page 1. Scrape `title`, `date`, `location`, `abstract`, and `text` from a few COVID-19 speech pages to develop the code 2. Write a function that scrapes `title`, `date`, `location`, `abstract`, and `text` from COVID-19 speech pages 3. Scrape the `url`s of COVID-19 speeches from the main page 4. Use this function to scrape from each individual COVID-19 speech from these `url`s and create a data frame with the columns `title`, `date`, `location`, `abstract`, `text`, and `url` --- ## rvest .pull-left[ - The **rvest** package makes basic processing and manipulation of HTML data straight forward - It's designed to work with pipelines built with `%>%` ```r library(rvest) ``` ] .pull-right[ <img src="img/rvest.png" width="230" style="display: block; margin: auto 0 auto auto;" /> ] --- ## Read page for 26 Oct speech ```r url <- "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-26-october/" speech_page <- read_html(url) ``` .pull-left[ ```r speech_page ``` ``` ## {html_document} ## <html dir="ltr" lang="en"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html ... ## [2] <body class="fontawesome site-header__container">\n\n\n\n\ ... ``` ] .pull-right[ <img src="img/fm-speech-oct-26.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Extract title .pull-left-wide[ <br><br> ```r title <- speech_page %>% html_node(".article-header__title") %>% html_text() title ``` ``` ## [1] "Coronavirus (COVID-19) update: First Minister's speech 26 October" ``` ] .pull-right-narrow[ <img src="img/title.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extract date .pull-left-wide[ ```r library(lubridate) speech_page %>% html_node(".content-data__list:nth-child(1) strong") %>% html_text() ``` ``` ## [1] "26 Oct 2020" ``` ```r date <- speech_page %>% html_node(".content-data__list:nth-child(1) strong") %>% html_text() %>% dmy() date ``` ``` ## [1] "2020-10-26" ``` ] .pull-right-narrow[ <img src="img/date.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Similarly... extract location, abstract, and text --- ## Put it all in a data frame .pull-left[ ```r oct_26_speech <- tibble( title = title, date = date, location = location, abstract = abstract, text = text, url = url ) oct_26_speech ``` ``` ## # A tibble: 1 x 6 ## title date location abstract text url ## <chr> <date> <chr> <chr> <lis> <chr> ## 1 Coronaviru… 2020-10-26 St Andrew… Statement g… <chr… https://w… ``` ] .pull-right[ <img src="img/fm-speech-oct-26.png" width="75%" style="display: block; margin: auto;" /> ] --- ## Plan: Get data from all pages - Write a function that scrapes the data from a single page and returns a data frame with a single row for that page - Obtain a list of URLs of all pages - Map the function over the list of all URLs to obtain a data framw where each row is a single speech and the number of rows is the number of speeches ``` ## # A tibble: 207 x 6 ## title date location abstract text url ## <chr> <date> <chr> <chr> <chr> <chr> ## 1 Coronavi… 2021-03-09 Scottish … Statement g… "Presidi… https:/… ## 2 Coronavi… 2021-03-05 Scottish … Parliamenta… "Hello. … https:/… ## 3 Coronavi… 2021-03-04 Scottish … Parliamenta… "I will … https:/… ## 4 Coronavi… 2021-03-02 Scottish … Statement g… "Presidi… https:/… ## 5 Coronavi… 2021-02-25 Scottish … Statement g… "I will … https:/… ## 6 Coronavi… 2021-02-24 St Andrew… Statement g… "\nGood … https:/… ## # … with 201 more rows ``` --- ## Write a function .xsmall[ ```r scrape_speech_scot <- function(url){ speech_page <- read_html(url) title <- speech_page %>% html_node(".article-header__title") %>% html_text() date <- speech_page %>% html_node(".content-data__list:nth-child(1) strong") %>% html_text() %>% dmy() location <- speech_page %>% html_node(".content-data__list+ .content-data__list strong") %>% html_text() abstract <- speech_page %>% html_node(".leader--first-para p") %>% html_text() text <- speech_page %>% html_nodes("#preamble p") %>% html_text() %>% glue_collapse(sep = " ") %>% as.character() tibble( title = title, date = date, location = location, abstract = abstract, text = text, url = url ) } ``` ] --- ## Get a list of all URLs ```r all_speeches_page_scot <- read_html("https://www.gov.scot/collections/first-ministers-speeches/") covid_speech_urls_uk_scot <- all_speeches_page_scot %>% html_nodes(".collections-list a") %>% html_attr("href") %>% str_subset("covid-19") %>% str_c("https://www.gov.scot", .) ``` --- ## Map the function over all URLs ```r covid_speeches_scot <- map_dfr(covid_speech_urls_uk_scot, scrape_speech_scot) ``` ```r covid_speeches_scot ``` ``` ## # A tibble: 207 x 6 ## title date location abstract text url ## <chr> <date> <chr> <chr> <chr> <chr> ## 1 Coronavi… 2021-03-09 Scottish … Statement g… "Presidi… https:/… ## 2 Coronavi… 2021-03-05 Scottish … Parliamenta… "Hello. … https:/… ## 3 Coronavi… 2021-03-04 Scottish … Parliamenta… "I will … https:/… ## 4 Coronavi… 2021-03-02 Scottish … Statement g… "Presidi… https:/… ## 5 Coronavi… 2021-02-25 Scottish … Statement g… "I will … https:/… ## 6 Coronavi… 2021-02-24 St Andrew… Statement g… "\nGood … https:/… ## # … with 201 more rows ``` --- class: middle # Transform and visualise --- ## Filter for First minister speeches ```r covid_speeches_scot <- covid_speeches_scot %>% filter(str_detect(abstract, "First Minister")) covid_speeches_scot ``` ``` ## # A tibble: 204 x 6 ## title date location abstract text url ## <chr> <date> <chr> <chr> <chr> <chr> ## 1 Coronavi… 2021-03-09 Scottish … Statement g… "Presidi… https:/… ## 2 Coronavi… 2021-03-05 Scottish … Parliamenta… "Hello. … https:/… ## 3 Coronavi… 2021-03-04 Scottish … Parliamenta… "I will … https:/… ## 4 Coronavi… 2021-03-02 Scottish … Statement g… "Presidi… https:/… ## 5 Coronavi… 2021-02-25 Scottish … Statement g… "I will … https:/… ## 6 Coronavi… 2021-02-24 St Andrew… Statement g… "\nGood … https:/… ## # … with 198 more rows ``` --- ## Count number of words in each speech ```r covid_speeches_scot <- covid_speeches_scot %>% rowwise() %>% mutate(n_words = text %>% str_count("\\w+") %>% sum()) %>% ungroup() covid_speeches_scot ``` ``` ## # A tibble: 204 x 7 ## title date location abstract text url n_words ## <chr> <date> <chr> <chr> <chr> <chr> <int> ## 1 Coronav… 2021-03-09 Scottish… Statemen… "Presi… https:… 3355 ## 2 Coronav… 2021-03-05 Scottish… Parliame… "Hello… https:… 1199 ## 3 Coronav… 2021-03-04 Scottish… Parliame… "I wil… https:… 561 ## 4 Coronav… 2021-03-02 Scottish… Statemen… "Presi… https:… 2365 ## 5 Coronav… 2021-02-25 Scottish… Statemen… "I wil… https:… 700 ## 6 Coronav… 2021-02-24 St Andre… Statemen… "\nGoo… https:… 2466 ## # … with 198 more rows ``` --- ## Length of speech over time .panelset[ .panel[.panel-name[Plot] <img src="tidyverse-tidymodels_files/figure-html/words-over-time-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[R Code] ```r ggplot(covid_speeches_scot, aes(x = date, y = n_words)) + geom_point(alpha = 0.7) + geom_smooth(aes(x = date, y = n_words), method = lm, formula = y ~ x) + labs( title = "Length of Scotland COVID-19 speeches", subtitle = "Measured in number of words" ) ``` ] ] --- ## Length of speech over time, again .panelset[ .panel[.panel-name[Plot] <img src="tidyverse-tidymodels_files/figure-html/words-over-time-better-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[R Code] ```r # set color: https://www.schemecolor.com/flag-of-scotland-colors.php *scotblue <- "#0065BF" covid_speeches_scot %>% ggplot(aes(x = date, y = n_words)) + * geom_point(alpha = 0.7, color = scotblue) + * geom_smooth(aes(x = date, y = n_words), method = lm, formula = y ~ x, color = "darkgray") + labs( title = "Length of Scotland COVID-19 speeches over time", subtitle = "Measured in number of words", * x = NULL, * y = "Number of words" ) + * theme_minimal() ``` ] ] --- ## tidytext .pull-left[ - Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use - Learn more at [tidytextmining.com](https://www.tidytextmining.com/) ```r library(tidytext) ``` ] .pull-right[ <img src="img/tidytext.png" width="60%" style="display: block; margin: auto auto auto 0;" /> ] --- ## Tokenize speeches by word .panelset[ .panel[.panel-name[Plot] ``` ## # A tibble: 422,860 x 7 ## date word title location abstract url n_words ## <date> <chr> <chr> <chr> <chr> <chr> <int> ## 1 2021-03-09 presi… Coronav… Scottish… Statement… https:… 3355 ## 2 2021-03-09 offic… Coronav… Scottish… Statement… https:… 3355 ## 3 2021-03-09 i Coronav… Scottish… Statement… https:… 3355 ## 4 2021-03-09 will Coronav… Scottish… Statement… https:… 3355 ## 5 2021-03-09 update Coronav… Scottish… Statement… https:… 3355 ## 6 2021-03-09 parli… Coronav… Scottish… Statement… https:… 3355 ## # … with 422,854 more rows ``` ] .panel[.panel-name[R Code] ```r covid_speeches_scot_words <- covid_speeches_scot %>% # make sure COVID-19 (and all its various spellings) don't get split # tidytext doesn't remove underscores # https://stackoverflow.com/questions/58281091/preserve-hyphenated-words-in-ngrams-analysis-with-tidytext mutate( text = str_replace_all(text, "COVID-19", "COVID_19"), text = str_replace_all(text, "COVID 19", "COVID_19"), text = str_replace_all(text, "Covid-19", "COVID_19"), text = str_replace_all(text, "Covid 19", "COVID_19") ) %>% unnest_tokens(word, text) %>% relocate(date, word) covid_speeches_scot_words ``` ] ] --- ## Find common words .panelset[ .panel[.panel-name[Plot] <img src="tidyverse-tidymodels_files/figure-html/scot-common-words-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[R Code] ```r covid_speeches_scot_words %>% anti_join(stop_words) %>% count(word, sort = TRUE) %>% slice_head(n = 15) %>% ggplot(aes(y = fct_reorder(word, n), x = n, fill = log(n))) + geom_col(show.legend = FALSE) + labs( title = "Commonly used words in Scotland COVID-19 briefings", y = NULL, x = "Frequency" ) + theme_minimal() ``` ] ] --- ## Social vs. physical distancing .panelset[ .panel[.panel-name[Plot] <img src="tidyverse-tidymodels_files/figure-html/social-physical-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[R Code] ```r covid_speeches_scot %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% filter(str_detect(bigram, "social dist|physical dist")) %>% mutate(soc_phys = if_else(str_detect(bigram, "social"), "S", "P")) %>% count(date, soc_phys) %>% ggplot(aes(x = date, y = n, color = soc_phys)) + geom_text(aes(label = soc_phys)) + guides(color = FALSE) + labs(x = "Date", y = "Frequency", title = "Social (S) vs. physical (P) distancing", subtitle = "Number of mentions over time") + scale_color_manual(values = c(scotblue, "darkgray")) + scale_y_continuous(limits = c(0, 10), breaks = seq(0, 10, 2)) + theme_minimal() ``` ] ] --- ## Vaccines .panelset[ .panel[.panel-name[Plot] <img src="tidyverse-tidymodels_files/figure-html/vaccines-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[R Code] ```r covid_speeches_scot_words %>% filter(str_detect(word, "[Vv]accin")) %>% count(date) %>% ggplot(aes(x = date, y = n)) + geom_text(aes(label = "💉", size = n), show.legend = FALSE) + labs(x = "Date", y = "Frequency", title = 'Number of times "vaccine" is mentioned in speech') + theme_minimal() ``` ] ] --- ## Compare to UK <img src="tidyverse-tidymodels_files/figure-html/unnamed-chunk-38-1.png" width="85%" style="display: block; margin: auto;" /> --- class: middle # Model --- ## Predicting UK vs. Scotland ```r covid_speeches %>% count(origin) ``` ``` ## # A tibble: 2 x 2 ## origin n ## * <fct> <int> ## 1 Scotland 207 ## 2 UK 40 ``` --- ## Tokenize into sentences ```r covid_speeches_sentences <- covid_speeches %>% unnest_tokens(sentence, text, token = "sentences") covid_speeches_sentences %>% relocate(sentence) ``` ``` ## # A tibble: 20,445 x 4 ## sentence speech_id date origin ## <chr> <chr> <date> <fct> ## 1 presiding officer i will update pa… 1 2021-03-09 Scotl… ## 2 these changes relate to outdoor me… 1 2021-03-09 Scotl… ## 3 i will also announce a change whic… 1 2021-03-09 Scotl… ## 4 now while the changes i set out to… 1 2021-03-09 Scotl… ## 5 and they do represent gradual, but… 1 2021-03-09 Scotl… ## 6 next week, i will set out a firmer… 1 2021-03-09 Scotl… ## # … with 20,439 more rows ``` --- ## Split into testing and training ```r set.seed(1234) covid_split <- initial_split(covid_speeches_sentences, strata = origin) covid_train <- training(covid_split) covid_test <- testing(covid_split) ``` --- ## Specify a model ```r lasso_mod <- logistic_reg(penalty = 0.005, mixture = 1) %>% set_engine("glmnet") ``` --- ## Build a recipe ```r library(textrecipes) covid_rec <- recipe(origin ~ sentence, data = covid_train) %>% # tokenize into words step_tokenize(sentence, token = "words") %>% # filter out stop words step_stopwords(sentence) %>% # all the 1-grams followed by all the 2-grams followed by all the 3-grams step_ngram(sentence, num_tokens = 3, min_num_tokens = 1) %>% # keep the 500 most frequent words to avoid creating too many variables step_tokenfilter(sentence, max_tokens = 500) %>% # calculate tf-idf step_tfidf(sentence) ``` --- ## Build a workflow ```r covid_wflow <- workflow() %>% add_model(lasso_mod) %>% add_recipe(covid_rec) ``` --- ## Perform cross-validation ```r set.seed(1234) covid_folds <- vfold_cv(covid_train, v = 10, strata = origin) ``` ```r covid_folds ``` ``` ## # 10-fold cross-validation using stratification ## # A tibble: 10 x 2 ## splits id ## <list> <chr> ## 1 <split [13800/1534]> Fold01 ## 2 <split [13800/1534]> Fold02 ## 3 <split [13800/1534]> Fold03 ## 4 <split [13800/1534]> Fold04 ## 5 <split [13801/1533]> Fold05 ## 6 <split [13801/1533]> Fold06 ## # … with 4 more rows ``` --- ## Fit resamples ```r covid_fit_rs <- covid_wflow %>% fit_resamples( covid_folds, control = control_resamples(save_pred = TRUE) ) ``` ```r covid_train_metrics <- collect_metrics(covid_fit_rs) covid_train_metrics ``` ``` ## # A tibble: 2 x 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy binary 0.918 10 0.00234 Preprocessor1_Model1 ## 2 roc_auc binary 0.784 10 0.00558 Preprocessor1_Model1 ``` --- ## ROC curve .small[ ```r covid_train_pred <- collect_predictions(covid_fit_rs) covid_train_pred %>% group_by(id) %>% roc_curve(truth = origin, .pred_Scotland) %>% autoplot() + labs( title = "ROC curve for Scotland & UK COVID speeches", subtitle = "Each resample fold is shown in a different color" ) ``` <img src="tidyverse-tidymodels_files/figure-html/unnamed-chunk-52-1.png" width="60%" style="display: block; margin: auto;" /> ] --- ## Make predictions for test data ```r covid_fit <- covid_wflow %>% fit(data = covid_train) covid_test_pred <- predict(covid_fit, new_data = covid_test, type = "prob") %>% bind_cols(covid_test %>% select(origin, speech_id, sentence)) covid_test_pred ``` ``` ## # A tibble: 5,111 x 5 ## .pred_Scotland .pred_UK origin speech_id sentence ## <dbl> <dbl> <fct> <chr> <chr> ## 1 0.927 0.0726 Scotla… 1 i will though be spe… ## 2 0.961 0.0387 Scotla… 1 and the total number… ## 3 0.905 0.0950 Scotla… 1 and then in two week… ## 4 0.884 0.116 Scotla… 1 current plans for 23… ## 5 0.940 0.0599 Scotla… 1 in addition, i know … ## 6 0.940 0.0599 Scotla… 1 and so i think it is… ## # … with 5,105 more rows ``` --- ## ROC curve ```r covid_test_pred %>% roc_curve(truth = origin, .pred_Scotland) %>% autoplot() ``` <img src="tidyverse-tidymodels_files/figure-html/unnamed-chunk-54-1.png" width="60%" style="display: block; margin: auto;" /> --- ## View predictions ```r covid_test_pred %>% roc_auc(truth = origin, .pred_Scotland) ``` ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 roc_auc binary 0.762 ``` ```r covid_test_pred %>% filter(origin == "Scotland", .pred_UK > 0.5) ``` ``` ## # A tibble: 14 x 5 ## .pred_Scotland .pred_UK origin speech_id sentence ## <dbl> <dbl> <fct> <chr> <chr> ## 1 0.351 0.649 Scotla… 19 we must guard agains… ## 2 0.472 0.528 Scotla… 37 and if that happens,… ## 3 0.346 0.654 Scotla… 51 we are seeing simila… ## 4 0.354 0.646 Scotla… 58 we must drive the in… ## 5 0.408 0.592 Scotla… 68 the chancellors’ com… ## 6 0.470 0.530 Scotla… 105 i will announce furt… ## # … with 8 more rows ``` --- ## What decisions did we make? without thinking about it much... - `step_tokenfilter(sentence, max_tokens = 500)` -- why 500 for max_tokens? - `logistic_reg(penalty = 0.005, mixture = 1)` -- why 0.005 for penalty? --- ## Tuning ### Specify model ```r ## with penalty ?? lasso_mod_tune <- logistic_reg(penalty = tune(), mixture = 1) %>% set_engine("glmnet") %>% set_mode("classification") ``` --- ## Tuning ### Build recipe ```r covid_rec_tune <- recipe(origin ~ sentence, data = covid_train) %>% step_tokenize(sentence, token = "words") %>% step_stopwords(sentence) %>% step_ngram(sentence, num_tokens = 3, min_num_tokens = 1) %>% # keep the ?? most frequent words to avoid creating too many variables step_tokenfilter(sentence, max_tokens = tune(), min_times = 5) %>% step_tfidf(sentence) ``` --- ## Tuning ### Build workflow ```r covid_wflow_tune <- workflow() %>% add_model(lasso_mod_tune) %>% add_recipe(covid_rec_tune) ``` --- ## Tuning ### Possible set of hyperparameters ```r param_grid <- grid_regular( penalty(range = c(-4, 0)), max_tokens(range = c(500, 2000)), levels = 5 ) param_grid ``` ``` ## # A tibble: 25 x 2 ## penalty max_tokens ## <dbl> <int> ## 1 0.0001 500 ## 2 0.001 500 ## 3 0.01 500 ## 4 0.1 500 ## 5 1 500 ## 6 0.0001 875 ## # … with 19 more rows ``` --- ## Tuning ### Train models with all possible set of hyperparameters ```r set.seed(24) covid_fit_rs_tune <- tune_grid( covid_wflow_tune, resamples = covid_folds, grid = param_grid, control = control_grid(save_pred = TRUE) ) ``` --- ## Tuning ### Results ```r covid_fit_rs_tune ``` ``` ## Warning: This tuning result has notes. Example notes on model fitting include: ## preprocessor 2/5, model 1/1 (predictions): partial argument match of 'along' to 'along.with' ## preprocessor 1/5, model 1/1 (predictions): partial argument match of 'along' to 'along.with' ## preprocessor 5/5, model 1/1 (predictions): partial argument match of 'along' to 'along.with' ``` ``` ## # Tuning results ## # 10-fold cross-validation using stratification ## # A tibble: 10 x 5 ## splits id .metrics .notes .predictions ## <list> <chr> <list> <list> <list> ## 1 <split [13800… Fold01 <tibble [50… <tibble [5… <tibble [38,350… ## 2 <split [13800… Fold02 <tibble [50… <tibble [5… <tibble [38,350… ## 3 <split [13800… Fold03 <tibble [50… <tibble [5… <tibble [38,350… ## 4 <split [13800… Fold04 <tibble [50… <tibble [5… <tibble [38,350… ## 5 <split [13801… Fold05 <tibble [50… <tibble [5… <tibble [38,325… ## 6 <split [13801… Fold06 <tibble [50… <tibble [5… <tibble [38,325… ## # … with 4 more rows ``` --- ## Tuning ### Select best ```r best_roc_auc <- select_best(covid_fit_rs_tune, "roc_auc") best_roc_auc ``` ``` ## # A tibble: 1 x 3 ## penalty max_tokens .config ## <dbl> <int> <chr> ## 1 0.001 2000 Preprocessor5_Model2 ``` ```r covid_wflow_final <- finalize_workflow(covid_wflow_tune, best_roc_auc) ``` --- ## Variable importance ```r library(vip) vi_data <- covid_wflow_final %>% fit(covid_train) %>% pull_workflow_fit() %>% vi(lambda = best_roc_auc$penalty) %>% mutate(Variable = str_remove_all(Variable, "tfidf_sentence_")) %>% filter(Importance != 0) ``` ```r vi_data ``` ``` ## # A tibble: 1,088 x 3 ## Variable Importance Sign ## <chr> <dbl> <chr> ## 1 covid_secure 14.9 POS ## 2 scotland 8.29 NEG ## 3 coronavirus 7.34 POS ## 4 british 7.29 POS ## 5 r 6.59 POS ## 6 r_number 6.55 NEG ## # … with 1,082 more rows ``` --- ## Variable importance, visualised .panelset[ .panel[.panel-name[Plot] <img src="tidyverse-tidymodels_files/figure-html/vip-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[R Code] ```r vi_data %>% mutate(Importance = abs(Importance)) %>% filter(Importance != 0) %>% group_by(Sign) %>% slice_head(n = 20) %>% ungroup() %>% mutate(pred_origin = if_else(Sign == "POS", "UK", "Scotland")) %>% ggplot(aes(x = Importance, y = fct_reorder(Variable, Importance), fill = pred_origin )) + geom_col(show.legend = FALSE) + scale_x_continuous(expand = c(0, 0)) + scale_fill_manual(values = c(scotblue, ukred)) + facet_wrap(~pred_origin, scales = "free") + labs(y = NULL) + theme_minimal() ``` ] ] --- ## Prediction .small[ ```r scot_sentence$sentence ``` ``` ## [1] "all local authorities will - at least until easter - continue to observe the current requirement for 2m physical distancing in secondary schools." ``` ```r scot_sentence %>% tidytext::unnest_tokens(words, sentence) %>% left_join(vi_data, by = c("words" = "Variable")) %>% mutate(pred_origin = if_else(Sign == "NEG", "Scotland", "UK")) %>% print(n = 25) ``` ``` ## # A tibble: 21 x 7 ## speech_id date origin words Importance Sign pred_origin ## <chr> <date> <fct> <chr> <dbl> <chr> <chr> ## 1 4 2021-03-02 Scotl… all NA <NA> <NA> ## 2 4 2021-03-02 Scotl… local 4.21 POS UK ## 3 4 2021-03-02 Scotl… auth… 0.453 NEG Scotland ## 4 4 2021-03-02 Scotl… will NA <NA> <NA> ## 5 4 2021-03-02 Scotl… at NA <NA> <NA> ## 6 4 2021-03-02 Scotl… least 1.34 POS UK ## 7 4 2021-03-02 Scotl… until NA <NA> <NA> ## 8 4 2021-03-02 Scotl… east… 0.0537 NEG Scotland ## 9 4 2021-03-02 Scotl… cont… NA <NA> <NA> ## 10 4 2021-03-02 Scotl… to NA <NA> <NA> ## 11 4 2021-03-02 Scotl… obse… NA <NA> <NA> ## 12 4 2021-03-02 Scotl… the NA <NA> <NA> ## 13 4 2021-03-02 Scotl… curr… NA <NA> <NA> ## 14 4 2021-03-02 Scotl… requ… NA <NA> <NA> ## 15 4 2021-03-02 Scotl… for NA <NA> <NA> ## 16 4 2021-03-02 Scotl… 2m NA <NA> <NA> ## 17 4 2021-03-02 Scotl… phys… 0.478 NEG Scotland ## 18 4 2021-03-02 Scotl… dist… NA <NA> <NA> ## 19 4 2021-03-02 Scotl… in NA <NA> <NA> ## 20 4 2021-03-02 Scotl… seco… 1.46 POS UK ## 21 4 2021-03-02 Scotl… scho… 1.51 POS UK ``` ] --- ## Prediction .small[ ```r uk_sentence$sentence ``` ``` ## [1] "we have now vaccinated 1.26 million people in england, 113,000 in scotland, 49,000 in wales, and 46,000 in northern ireland." ``` ```r uk_sentence %>% tidytext::unnest_tokens(words, sentence) %>% left_join(vi_data, by = c("words" = "Variable")) %>% mutate(pred_origin = if_else(Sign == "NEG", "Scotland", "UK")) %>% print(n = 25) ``` ``` ## # A tibble: 20 x 7 ## speech_id date origin words Importance Sign pred_origin ## <chr> <date> <fct> <chr> <dbl> <chr> <chr> ## 1 8 2021-01-07 UK we NA <NA> <NA> ## 2 8 2021-01-07 UK have NA <NA> <NA> ## 3 8 2021-01-07 UK now 0.898 POS UK ## 4 8 2021-01-07 UK vacc… 0.767 POS UK ## 5 8 2021-01-07 UK 1.26 NA <NA> <NA> ## 6 8 2021-01-07 UK mill… 1.72 POS UK ## 7 8 2021-01-07 UK peop… 0.274 POS UK ## 8 8 2021-01-07 UK in NA <NA> <NA> ## 9 8 2021-01-07 UK engl… 3.96 POS UK ## 10 8 2021-01-07 UK 113,… NA <NA> <NA> ## 11 8 2021-01-07 UK in NA <NA> <NA> ## 12 8 2021-01-07 UK scot… 8.29 NEG Scotland ## 13 8 2021-01-07 UK 49,0… NA <NA> <NA> ## 14 8 2021-01-07 UK in NA <NA> <NA> ## 15 8 2021-01-07 UK wales NA <NA> <NA> ## 16 8 2021-01-07 UK and NA <NA> <NA> ## 17 8 2021-01-07 UK 46,0… NA <NA> <NA> ## 18 8 2021-01-07 UK in NA <NA> <NA> ## 19 8 2021-01-07 UK nort… 5.37 POS UK ## 20 8 2021-01-07 UK irel… 0.202 NEG Scotland ``` ] --- ## Acknowledgements & learn more - Read the full case study, with code: https://github.com/mine-cetinkaya-rundel/fm-speeches-covid19 - Learn tidyverse: https://www.tidyverse.org/learn - Learn tidymodels: https://www.tidymodels.org/learn - Much of this was inspired by Julia Silge and Emil Hvitfeldt's useR tutorial: https://emilhvitfeldt.github.io/useR2020-text-modeling-tutorial