class: center, middle, inverse, title-slide # Data tidying and reshaping
🧹 ### --- layout: true <div class="my-footer"> <span> <a href="http://bit.ly/bootcamp-nuigalway" target="_blank">bit.ly/bootcamp-nuigalway</a> </span> </div> --- ## A grammar of data tidying .pull-left[ <img src="img/tidyr-part-of-tidyverse.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ The goal of tidyr is to help you create tidy data ] --- class: middle # Instructional staff employment trends --- The American Association of University Professors (AAUP) is a nonprofit membership association of faculty and other academic professionals. [This report](https://www.aaup.org/sites/default/files/files/AAUP_Report_InstrStaff-75-11_apr2013.pdf) by the AAUP shows trends in instructional staff employees between 1975 and 2011, and contains an image very similar to the one given below. <img src="img/staff-employment.png" width="70%" style="display: block; margin: auto;" /> --- ## Data Each row in this dataset represents a faculty type, and the columns are the years for which we have data. The values are percentage of hires of that type of faculty for each year. ```r staff <- read_csv("data/instructional-staff.csv") staff ``` ``` ## # A tibble: 5 x 12 ## faculty_type `1975` `1989` `1993` `1995` `1999` `2001` `2003` `2005` `2007` `2009` `2011` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Full-Time Tenured Fa… 29 27.6 25 24.8 21.8 20.3 19.3 17.8 17.2 16.8 16.7 ## 2 Full-Time Tenure-Tra… 16.1 11.4 10.2 9.6 8.9 9.2 8.8 8.2 8 7.6 7.4 ## 3 Full-Time Non-Tenure… 10.3 14.1 13.6 13.6 15.2 15.5 15 14.8 14.9 15.1 15.4 ## 4 Part-Time Faculty 24 30.4 33.1 33.2 35.5 36 37 39.3 40.5 41.1 41.3 ## 5 Graduate Student Emp… 20.5 16.5 18.1 18.8 18.7 19 20 19.9 19.5 19.4 19.3 ``` --- ## Recreate the visualization In order to recreate this visualization we need to first reshape the data to have one variable for faculty type and one variable for year. In other words, we will convert the data from the long format to wide format. But before we do so... .discussion[ If the long data will have a row for each year/faculty type combination, and there are 5 faculty types and 11 years of data, how many rows will the data have? ] --- class: middle <img src="img/pivot.gif" width="80%" style="display: block; margin: auto;" /> --- ## `pivot_*()` functions <img src="img/tidyr-longer-wider.gif" width="50%" style="display: block; margin: auto;" /> --- ## `pivot_longer()` ```r pivot_longer(data, cols, names_to = "name", values_to = "value") ``` - The first argument is `data` as usual. -- - The second argument, `cols`, is where you specify which columns to pivot into longer format -- in this case all columns except for the `faculty_type` -- - The third argument, `names_to`, is a string specifying the name of the column to create from the data stored in the column names of data -- in this case `year` -- - The fourth argument, `values_to`, is a string specifying the name of the column to create from the data stored in cell values, in this case `percentage` --- ## Pivot staff data ```r staff %>% pivot_longer( cols = -faculty_type, names_to = "year", values_to = "percentage" ) ``` ``` ## # A tibble: 55 x 3 ## faculty_type year percentage ## <chr> <chr> <dbl> ## 1 Full-Time Tenured Faculty 1975 29 ## 2 Full-Time Tenured Faculty 1989 27.6 ## 3 Full-Time Tenured Faculty 1993 25 ## 4 Full-Time Tenured Faculty 1995 24.8 ## 5 Full-Time Tenured Faculty 1999 21.8 ## 6 Full-Time Tenured Faculty 2001 20.3 ## # … with 49 more rows ``` --- ## Pivot staff data, and save result ```r staff_long <- staff %>% pivot_longer( cols = -faculty_type, names_to = "year", values_to = "percentage" ) staff_long ``` ``` ## # A tibble: 55 x 3 ## faculty_type year percentage ## <chr> <chr> <dbl> ## 1 Full-Time Tenured Faculty 1975 29 ## 2 Full-Time Tenured Faculty 1989 27.6 ## 3 Full-Time Tenured Faculty 1993 25 ## 4 Full-Time Tenured Faculty 1995 24.8 ## 5 Full-Time Tenured Faculty 1999 21.8 ## 6 Full-Time Tenured Faculty 2001 20.3 ## # … with 49 more rows ``` --- .discussion[ This doesn't look quite right, how would you fix it? ] ```r staff_long %>% ggplot(aes(x = percentage, y = year, color = faculty_type)) + geom_col(position = "dodge") ``` <img src="05-tidy-data_files/figure-html/unnamed-chunk-8-1.png" width="70%" style="display: block; margin: auto;" /> --- ```r staff_long %>% ggplot(aes(x = percentage, y = year, fill = faculty_type)) + geom_col(position = "dodge") ``` <img src="05-tidy-data_files/figure-html/unnamed-chunk-9-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Some improvement... ```r staff_long %>% ggplot(aes(x = percentage, y = year, fill = faculty_type)) + geom_col() ``` <img src="05-tidy-data_files/figure-html/unnamed-chunk-10-1.png" width="80%" style="display: block; margin: auto;" /> --- ## More improvement ```r staff_long %>% ggplot(aes(x = year, y = percentage, group = faculty_type, color = faculty_type)) + geom_line() ``` <img src="05-tidy-data_files/figure-html/unnamed-chunk-11-1.png" width="80%" style="display: block; margin: auto;" /> --- .discussion[ What is the difference between these two plots? ] <img src="05-tidy-data_files/figure-html/unnamed-chunk-12-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Make year numeric again! ```r staff_long <- staff_long %>% mutate(year = as.numeric(year)) staff_long %>% ggplot(aes(x = year, y = percentage, group = faculty_type, color = faculty_type)) + geom_line() ``` <img src="05-tidy-data_files/figure-html/unnamed-chunk-13-1.png" width="70%" style="display: block; margin: auto;" /> --- .discussion[ How would you go about creating the following plot? ] <img src="05-tidy-data_files/figure-html/unnamed-chunk-14-1.png" width="80%" style="display: block; margin: auto;" /> --- ```r staff_long %>% * mutate(part_time = if_else(faculty_type == "Part-Time Faculty", * "Part-Time Faculty", * "Other Faculty")) %>% ggplot(aes(x = year, y = percentage/100, group = faculty_type, color = part_time)) + geom_line() + scale_color_manual(values = c("gray", "red")) + theme_minimal() + labs( title = "Instructional staff employment trends", x = "Year", y = "Proportion", color = "" ) + theme(legend.position = "bottom") ``` --- ```r staff_long %>% mutate(part_time = if_else(faculty_type == "Part-Time Faculty", "Part-Time Faculty", "Other Faculty")) %>% ggplot(aes(x = year, y = percentage/100, * group = faculty_type, * color = part_time)) + geom_line() + * scale_color_manual(values = c("gray", "red")) + * theme_minimal() + labs( title = "Instructional staff employment trends", x = "Year", y = "Proportion", color = "" ) + theme(legend.position = "bottom") ``` --- <img src="05-tidy-data_files/figure-html/unnamed-chunk-17-1.png" width="100%" style="display: block; margin: auto;" /> --- ```r *library(scales) staff_long %>% mutate(part_time = if_else(faculty_type == "Part-Time Faculty", "Part-Time Faculty", "Other Faculty")) %>% ggplot(aes(x = year, y = percentage/100, group = faculty_type, color = part_time)) + geom_line() + scale_color_manual(values = c("gray", "red")) + * scale_y_continuous(labels = percent) + theme_minimal() + labs( title = "Instructional staff employment trends", x = "Year", * y = "Percentage", color = "" ) + theme(legend.position = "bottom") ``` --- <img src="05-tidy-data_files/figure-html/unnamed-chunk-19-1.png" width="100%" style="display: block; margin: auto;" /> --- <img src="05-tidy-data_files/figure-html/unnamed-chunk-20-1.png" width="100%" style="display: block; margin: auto;" /> --- ```r library(scales) staff_long %>% mutate(part_time = if_else(faculty_type == "Part-Time Faculty", "Part-Time Faculty", "Other Faculty")) %>% ggplot(aes(x = year, y = percentage/100, group = faculty_type, color = part_time)) + geom_line() + scale_color_manual(values = c("gray", "red")) + * scale_y_continuous(labels = percent_format(accuracy = 1)) + theme_minimal() + labs( title = "Instructional staff employment trends", x = "Year", y = "Percentage", color = "" ) + theme(legend.position = "bottom") ``` --- class: middle # Other common tidying moves --- <img src="img/relig-income.png" width="85%" style="display: block; margin: auto;" /> .xsmall[ Source: [pewforum.org/religious-landscape-study/income-distribution](https://www.pewforum.org/religious-landscape-study/income-distribution/), Retrieved 14 April, 2020 ] --- ## Read data ```r library(readxl) rel_inc <- read_excel("data/relig-income.xlsx") # directly from Excel! ``` .small[ ``` ## # A tibble: 12 x 6 ## `Religious trad… `Less than $30,… `$30,000-$49,99… `$50,000-$99,99… `$100,000 or mo… `Sample Size` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Buddhist 0.36 0.18 0.32 0.13 233 ## 2 Catholic 0.36 0.19 0.26 0.19 6137 ## 3 Evangelical Pro… 0.35 0.22 0.28 0.14 7462 ## 4 Hindu 0.17 0.13 0.34 0.36 172 ## 5 Historically Bl… 0.53 0.22 0.17 0.08 1704 ## 6 Jehovah's Witne… 0.48 0.25 0.22 0.04 208 ## # … with 6 more rows ``` ] --- ## Rename columns .midi[ ```r rel_inc %>% rename( religion = `Religious tradition`, n = `Sample Size` ) ``` ``` ## # A tibble: 12 x 6 ## religion `Less than $30,00… `$30,000-$49,999` `$50,000-$99,99… `$100,000 or mor… n ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Buddhist 0.36 0.18 0.32 0.13 233 ## 2 Catholic 0.36 0.19 0.26 0.19 6137 ## 3 Evangelical Protest… 0.35 0.22 0.28 0.14 7462 ## 4 Hindu 0.17 0.13 0.34 0.36 172 ## 5 Historically Black … 0.53 0.22 0.17 0.08 1704 ## 6 Jehovah's Witness 0.48 0.25 0.22 0.04 208 ## # … with 6 more rows ``` ] -- .discussion[ If we want a new variable called `income` with levels such as "Less than $30,000", "$30,000-$49,999", ... etc. which function should we use? ] --- ## Pivot longer .midi[ ```r rel_inc %>% rename( religion = `Religious tradition`, n = `Sample Size` ) %>% pivot_longer( cols = -c(religion, n), # all but religion and n names_to = "income", values_to = "proportion" ) ``` ``` ## # A tibble: 48 x 4 ## religion n income proportion ## <chr> <dbl> <chr> <dbl> ## 1 Buddhist 233 Less than $30,000 0.36 ## 2 Buddhist 233 $30,000-$49,999 0.18 ## 3 Buddhist 233 $50,000-$99,999 0.32 ## 4 Buddhist 233 $100,000 or more 0.13 ## 5 Catholic 6137 Less than $30,000 0.36 ## 6 Catholic 6137 $30,000-$49,999 0.19 ## # … with 42 more rows ``` ] --- ## Calculate frequencies .midi[ ```r rel_inc %>% rename( religion = `Religious tradition`, n = `Sample Size` ) %>% pivot_longer( cols = -c(religion, n), names_to = "income", values_to = "proportion" ) %>% mutate(frequency = round(proportion * n)) ``` ``` ## # A tibble: 48 x 5 ## religion n income proportion frequency ## <chr> <dbl> <chr> <dbl> <dbl> ## 1 Buddhist 233 Less than $30,000 0.36 84 ## 2 Buddhist 233 $30,000-$49,999 0.18 42 ## 3 Buddhist 233 $50,000-$99,999 0.32 75 ## 4 Buddhist 233 $100,000 or more 0.13 30 ## 5 Catholic 6137 Less than $30,000 0.36 2209 ## 6 Catholic 6137 $30,000-$49,999 0.19 1166 ## # … with 42 more rows ``` ] --- ## Save data ```r rel_inc_long <- rel_inc %>% rename( religion = `Religious tradition`, n = `Sample Size` ) %>% pivot_longer( cols = -c(religion, n), names_to = "income", values_to = "proportion" ) %>% mutate(frequency = round(proportion * n)) ``` --- ## Start plotting ```r ggplot(rel_inc_long, aes(y = religion, x = frequency)) + geom_col() ``` <img src="05-tidy-data_files/figure-html/unnamed-chunk-29-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Reverse religion order ```r *ggplot(rel_inc_long, aes(y = fct_rev(religion), x = frequency)) + geom_col() ``` <img src="05-tidy-data_files/figure-html/unnamed-chunk-30-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Add income ```r *ggplot(rel_inc_long, aes(y = fct_rev(religion), x = frequency, fill = income)) + geom_col() ``` <img src="05-tidy-data_files/figure-html/unnamed-chunk-31-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Dodge bars .small[ ```r ggplot(rel_inc_long, aes(y = fct_rev(religion), x = frequency, fill = income)) + * geom_col(position = "dodge") ``` <img src="05-tidy-data_files/figure-html/unnamed-chunk-32-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Change theme .small[ ```r ggplot(rel_inc_long, aes(y = fct_rev(religion), x = frequency, fill = income)) + geom_col(position = "dodge") + * theme_minimal() ``` <img src="05-tidy-data_files/figure-html/unnamed-chunk-33-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Move legend to the bottom .small[ ```r ggplot(rel_inc_long, aes(y = fct_rev(religion), x = frequency, fill = income)) + geom_col(position = "dodge") + theme_minimal() + * theme(legend.position = "bottom") ``` <img src="05-tidy-data_files/figure-html/unnamed-chunk-34-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Prettier colors .small[ ```r ggplot(rel_inc_long, aes(y = fct_rev(religion), x = frequency, fill = income)) + geom_col(position = "dodge") + theme_minimal() + theme(legend.position = "bottom") + * scale_fill_viridis_d() ``` <img src="05-tidy-data_files/figure-html/unnamed-chunk-35-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Fix labels ```r ggplot(rel_inc_long, aes(y = fct_rev(religion), x = frequency, fill = income)) + geom_col(position = "dodge") + theme_minimal() + theme(legend.position = "bottom") + scale_fill_viridis_d() + * labs( * x = "Frequency", y = "", * title = "Income distribution by religious group", * caption = "Source: pewforum.org/religious-landscape-study/income-distribution" * ) ``` --- <img src="05-tidy-data_files/figure-html/unnamed-chunk-37-1.png" width="80%" style="display: block; margin: auto;" />