Tidy up your data science workflow with the tidyverse

# Tidy up your data science workflow with the tidyverse
## bit.ly/tidy-up-usfca

---

# tidyverse

---

## tidyverse

.pull-left[
- The **tidyverse** is an opinionated collection of R packages designed for data science. 
- All packages share an underlying design philosophy, grammar, and data structures.
<br>
<img src="img/tidyverse.png" width="50%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="img/tidyverse-packages.png" width="100%" />
]

---

## Tidy data

```r
knitr::include_graphics("img/tidy-data.png")
```

1. Each variable must have its own column.
1. Each observation must have its own row.
1. Each value must have its own cell.

---

## Pipe operator

> I want to find my keys, then start my car, then drive to work, then park my car.

- Nested

```r
park(drive(start_car(find("keys")), to = "work"))
```

- **Piped**

```r
find("keys") %>%
  start_car() %>%
  drive(to = "work") %>%
  park()
```

---

# Fisheries of the world

---

Fisheries and Aquaculture Department of the Food and Agriculture Organization of the United Nations collects data on fisheries production of countries. The (not-so-great) visualization belows shows the distribution of fishery harvest of countries for 2018, by capture and aquaculture.

<br>

.pull-left[
<img src="img/fisheries-data.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
- Countries whose total harvest was less than 100,000 tons are not 
included in the visualization.
- Source: [Fishing industry by country](https://en.wikipedia.org/wiki/Fishing_industry_by_country)
]

---

---

## Get the data

```r
names(fisheries)
```

```
## [1] "country"          "capture"          "aquaculture"      "total"            "continent"       
## [6] "aquaculture_perc"
```

---

## Inspect the data

```r
fisheries
```

```
## # A tibble: 68 x 6
##    country                           capture aquaculture   total continent aquaculture_perc
##    <chr>                               <dbl>       <dbl>   <dbl> <chr>                <dbl>
##  1 Algeria                            126259         368  126627 Africa             0.00291
##  2 Argentina                          931472        2430  933902 Americas           0.00260
##  3 Australia                          245935       47087  293022 Oceania            0.161  
##  4 Bangladesh                        1333866      882091 2215957 Asia               0.398  
##  5 Brazil                             750283      257783 1008066 Americas           0.256  
##  6 Cambodia                           384000       26000  410000 Asia               0.0634 
##  7 Canada                            1080982      154083 1235065 Americas           0.125  
##  8 Chile                             4330325      698214 5028539 Americas           0.139  
##  9 Colombia                           121000       60072  181072 Americas           0.332  
## 10 Congo, Democratic Republic of the  220000        2965  222965 Africa             0.0133 
## # … with 58 more rows
```
]

---

## Data prep

Filter out countries whose total harvest was less than 100,000 tons since they 
are not included in the visualization:

```r
fisheries <- fisheries %>%
  mutate(total = capture + aquaculture) %>%
  filter(total > 100000)
```
]
.pull-right[

```r
fisheries
```

---

## Load continent data

```r
continents <- read_csv("data/continents.csv")
```

---

# Data joins

---

```r
fisheries %>% select(country)
```

```
## # A tibble: 68 x 1
##    country                          
##    <chr>                            
##  1 Algeria                          
##  2 Argentina                        
##  3 Australia                        
##  4 Bangladesh                       
##  5 Brazil                           
##  6 Cambodia                         
##  7 Canada                           
##  8 Chile                            
##  9 Colombia                         
## 10 Congo, Democratic Republic of the
## # … with 58 more rows
```
]
.pull-right[

```r
continents
```

```
## # A tibble: 245 x 2
##    country           continent
##    <chr>             <chr>    
##  1 Afghanistan       Asia     
##  2 Åland Islands     Europe   
##  3 Albania           Europe   
##  4 Algeria           Africa   
##  5 American Samoa    Oceania  
##  6 Andorra           Europe   
##  7 Angola            Africa   
##  8 Anguilla          Americas 
##  9 Antigua & Barbuda Americas 
## 10 Argentina         Americas 
## # … with 235 more rows
```
]

---

## Joining data frames

```
something_join(x, y)
```

- `inner_join()`: all rows from x where there are matching values in y, return 
all combination of multiple matches in the case of multiple matches
- `left_join()`: all rows from x
- `right_join()`: all rows from y
- `full_join()`: all rows from both x and y
- `anti_join()`: return all rows from x where there are not matching values in y, never duplicate rows of x
- ...
 
---

## Setup

For the next few slides...

```r
x
```

```
## # A tibble: 3 x 1
##   value
##   <dbl>
## 1     1
## 2     2
## 3     3
```
]
.pull-right[

```r
y
```

```
## # A tibble: 3 x 1
##   value
##   <dbl>
## 1     1
## 2     2
## 3     4
```
]

---

## `inner_join()`

```r
inner_join(x, y)
```