library(tidyverse)
#> ── Attaching core tidyverse packages ───────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.2 ✔ readr 2.1.4
#> ✔ forcats 1.0.0 ✔ stringr 1.5.0
#> ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
#> ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
#> ✔ purrr 1.0.1
#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
6 Data tidying
Prerequisites
6.2.1 Exercises
In each of
table1
,table2
, andtable3
, each observation represents a country. Intable1
,country
is the country name,year
is the year of data collection,cases
is the number of people with the disease in that year, andpopulation
is the number of people in each country in that year. Intable2
,country
andyear
are the same as intable1
,type
is the type of number, andcount
is the number of observations (eithercases
orpopulation
depending ontype
). Finally, intable3
,country
andyear
are again the same as intable1
, andrate
is the rate of disease (cases divided by population).-
For
table2
, we need to reshape the data have a column for cases and a column for population and then divide the two to calculate the rate. A possible approach is shown below.table2 |> pivot_wider( names_from = type, values_from = count ) |> mutate(rate = cases / population * 10000) #> # A tibble: 6 × 5 #> country year cases population rate #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 Afghanistan 1999 745 19987071 0.373 #> 2 Afghanistan 2000 2666 20595360 1.29 #> 3 Brazil 1999 37737 172006362 2.19 #> 4 Brazil 2000 80488 174504898 4.61 #> 5 China 1999 212258 1272915272 1.67 #> 6 China 2000 213766 1280428583 1.67
For
table3
, we need to separate cases and population into their own columns and then divide them. A possible approach is shown below.table3 |> separate_wider_delim( cols = rate, delim = "/", names = c("cases", "population"), ) |> mutate( cases = as.numeric(cases), population = as.numeric(population), rate = cases / population * 10000 ) #> # A tibble: 6 × 5 #> country year cases population rate #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 Afghanistan 1999 745 19987071 0.373 #> 2 Afghanistan 2000 2666 20595360 1.29 #> 3 Brazil 1999 37737 172006362 2.19 #> 4 Brazil 2000 80488 174504898 4.61 #> 5 China 1999 212258 1272915272 1.67 #> 6 China 2000 213766 1280428583 1.67