class: center, middle, inverse, title-slide # Visualize data
📊 --- layout: true <div class="my-footer"> <span> <a href="http://bit.ly/bootcamp-nuigalway" target="_blank">bit.ly/bootcamp-nuigalway</a> </span> </div> --- class: middle # Exploratory data analysis --- ## What is EDA? - Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize its main characteristics. - Often, this is visual. That's what we're focusing on today. - But we might also calculate summary statistics and perform data wrangling/manipulation/transformation at (or before) this stage of the analysis. That's what we'll focus on next. --- class: middle # Data visualization --- ## Data visualization > *"The simple graph has brought more information to the data analyst’s mind than any other device." — John Tukey* - Data visualization is the creation and study of the visual representation of data. - There are many tools for visualizing data (R is one of them), and many approaches/systems within R for making data visualizations (**ggplot2** is one of them, and that's what we're going to use). --- ## ggplot2 `\(\in\)` tidyverse .pull-left[ <img src="img/ggplot2-part-of-tidyverse.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ - **ggplot2** is tidyverse's data visualization package - The `gg` in "ggplot2" stands for Grammar of Graphics - It is inspired by the book **Grammar of Graphics** by Leland Wilkinson ] --- ## Grammar of Graphics A grammar of graphics is a tool that enables us to concisely describe the components of a graphic <img src="img/grammar-of-graphics.png" width="70%" style="display: block; margin: auto;" /> .footnote[ Source: [BloggoType](http://bloggotype.blogspot.com/2016/08/holiday-notes2-grammar-of-graphics.html) ] --- ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + labs(title = "Mass vs. height of Starwars characters", x = "Height (cm)", y = "Weight (kg)") ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-4-1.png" width="70%" style="display: block; margin: auto;" /> --- .discussion[ - What are the functions doing the plotting? - What is the dataset being plotted? - Which variable is on the x-axis and which variable is on the y-axis? - What does the warning mean? ] ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + labs( title = "Mass vs. height of Starwars characters", x = "Height (cm)", y = "Weight (kg)" ) ``` ``` ## Warning: Removed 28 rows containing missing values (geom_point). ``` --- .discussion[ What does `geom_smooth()` do? ] ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + * geom_smooth() + labs(title = "Mass vs. height of Starwars characters", x = "Height (cm)", y = "Weight (kg)") ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-6-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Hello ggplot2! - `ggplot()` is the main function in ggplot2 - Plots are constructed in layers - Structure of the code for plots can be summarized as ```r ggplot(data = [dataset], mapping = aes(x = [x-variable], y = [y-variable])) + geom_xxx() + other options ``` - To use ggplot2 functions, first load tidyverse ```r library(tidyverse) ``` - For help with the ggplot2, see [ggplot2.tidyverse.org](http://ggplot2.tidyverse.org/) --- class: middle # Visualizing Star Wars --- ## Dataset terminology - Each row is an **observation** - Each column is a **variable** .small[ ```r starwars ``` ``` ## # A tibble: 87 x 14 ## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis> ## 1 Luke… 172 77 blond fair blue 19 male mascu… Tatooine Human <chr… ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooine Droid <chr… ## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… Naboo Droid <chr… ## 4 Dart… 202 136 none white yellow 41.9 male mascu… Tatooine Human <chr… ## 5 Leia… 150 49 brown light brown 19 fema… femin… Alderaan Human <chr… ## 6 Owen… 178 120 brown, gr… light blue 52 male mascu… Tatooine Human <chr… ## # … with 81 more rows, and 2 more variables: vehicles <list>, starships <list> ``` ] --- ## Luke Skywalker  --- ## What's in the Star Wars data? Take a `glimpse` at the data: ```r glimpse(starwars) ``` ``` ## Rows: 87 ## Columns: 14 ## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", "Owen Lars",… ## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 228, 180, 173, 175, … ## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.0, 84.0, NA, 112.0,… ## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", NA, "black", "auburn… ## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "light", "white, red… ## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue", "red", "brown", "b… ## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, 41.9, 64.0, 200.0, … ## $ sex <chr> "male", "none", "none", "male", "female", "male", "female", "none", "male", "m… ## $ gender <chr> "masculine", "masculine", "masculine", "masculine", "feminine", "masculine", "… ## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "Tatooine", "Tatooine… ## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Human", "Droid", "Human… ## $ films <list> [<"The Empire Strikes Back", "Revenge of the Sith", "Return of the Jedi", "A … ## $ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imperial Speeder Bike… ## $ starships <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1", <>, <>, <>, <>, "… ``` --- ## What's in the Star Wars data? .discussion[ How many rows and columns does this dataset have? What does each row represent? What does each column represent? ] ```r ?starwars ``` <img src="img/starwars-help.png" width="100%" style="display: block; margin: auto;" /> --- ## Mass vs. height ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-13-1.png" width="70%" style="display: block; margin: auto;" /> --- ## What's that warning? - Not all characters have height and mass information (hence 28 of them not plotted) ``` ## Warning: Removed 28 rows containing missing values (geom_point). ``` - Going forward I'll supress the warning to save room on slides, but it's important to note it --- ## Labels .small[ ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + * labs(title = "Mass vs. height of Starwars characters", * x = "Height (cm)", y = "Weight (kg)") ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-14-1.png" width="70%" style="display: block; margin: auto;" /> ] --- ## Mass vs. height .discussion[ How would you describe this relationship? What other variables would help us understand data points that don't follow the overall trend? Who is the not so tall but really chubby character? ] .small[ <img src="02-visualize-data_files/figure-html/unnamed-chunk-15-1.png" width="70%" style="display: block; margin: auto;" /> ] --- ## Jabba! <img src="img/jabbaplot.png" width="100%" style="display: block; margin: auto;" /> --- ## Additional variables We can map additional variables to various features of the plot: - aesthetics - shape - colour - size - alpha (transparency) - faceting: small multiples displaying different subsets --- class: middle # Aesthetics --- ## Aesthetics options Visual characteristics of plotting characters that can be **mapped to a specific variable** in the data are - `color` - `size` - `shape` - `alpha` (transparency) --- ## Mass vs. height + gender ```r ggplot(data = starwars, mapping = aes(x = height, y = mass, color = gender)) + geom_point() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-17-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Mass vs. height + gender ```r ggplot(data = starwars, mapping = aes(x = height, y = mass, color = gender, * size = birth_year)) + geom_point() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-18-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Mass vs. height + gender Let's now increase the size of all points **not** based on the values of a variable in the data: ```r ggplot(data = starwars, mapping = aes(x = height, y = mass, color = gender)) + * geom_point(size = 2) ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-19-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Aesthetics summary - Continuous variable are measured on a continuous scale - Discrete variables are measured (or often counted) on a discrete scale aesthetics | discrete | continuous ------------- | ------------------------ | ------------ color | rainbow of colors | gradient size | discrete steps | linear mapping between radius and value shape | different shape for each | shouldn't (and doesn't) work - Use aesthetics for mapping features of a plot to a variable, define the features in the geom for customization **not** mapped to a variable --- class: middle # Faceting --- ## Faceting - Smaller plots that display different subsets of the data - Useful for exploring conditional relationships and large data --- ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + * facet_grid(. ~ gender) + geom_point() + labs(title = "Mass vs. height of Starwars characters", * subtitle = "Faceted by gender") ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-20-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Dive further... .discussion[ In the next few slides describe what each plot displays. Think about how the code relates to the output. ] -- <br><br><br> .alert[ The plots in the next few slides do not have proper titles, axis labels, etc. because we want you to figure out what's happening in the plots. But you should always label your plots! ] --- ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + facet_grid(gender ~ .) ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-21-1.png" width="80%" style="display: block; margin: auto;" /> --- ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + facet_grid(. ~ gender) ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-22-1.png" width="80%" style="display: block; margin: auto;" /> --- ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + facet_wrap(~ eye_color) ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-23-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Facet summary - `facet_grid()`: - 2d grid - `rows ~ cols` - use `.` for no split - `facet_wrap()`: 1d ribbon wrapped into 2d --- class: middle # Identifying variables --- ## Number of variables involved * Univariate data analysis - distribution of single variable * Bivariate data analysis - relationship between two variables * Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others --- ## Types of variables - **Numerical variables** can be classified as **continuous** or **discrete** based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively. - If the variable is **categorical**, we can determine if it is **ordinal** based on whether or not the levels have a natural ordering. --- class: middle # Visualizing numerical data --- ## Describing shapes of numerical distributions * shape: * skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail) * modality: unimodal, bimodal, multimodal, uniform * center: mean (`mean`), median (`median`), mode (not always useful) * spread: range (`range`), standard deviation (`sd`), inter-quartile range (`IQR`) * unusal observations --- ## Histograms ```r ggplot(data = starwars, mapping = aes(x = height)) + geom_histogram(binwidth = 10) ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-24-1.png" width="75%" style="display: block; margin: auto;" /> --- ## Density plots ```r ggplot(data = starwars, mapping = aes(x = height)) + geom_density() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-25-1.png" width="75%" style="display: block; margin: auto;" /> --- ## Side-by-side box plots ```r ggplot(data = starwars, mapping = aes(y = height, x = gender)) + geom_boxplot() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-26-1.png" width="75%" style="display: block; margin: auto;" /> --- ## Ridge plots ```r library(ggridges) ggplot(starwars, aes(x = height, y = gender, fill = gender, color = gender)) + geom_density_ridges(alpha = 0.5) ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-27-1.png" width="70%" style="display: block; margin: auto;" /> --- class: middle # Visualizing categorical data --- ## Bar plots ```r ggplot(data = starwars, mapping = aes(x = gender)) + geom_bar() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-28-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Segmented bar plots, counts ```r ggplot(data = starwars, mapping = aes(x = gender, fill = hair_color)) + geom_bar() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-29-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Recode hair color ```r starwars <- starwars %>% mutate(hair_color2 = fct_other(hair_color, keep = c("black", "brown", "blond") ) ) ``` --- ## Segmented bar plots, counts ```r ggplot(data = starwars, mapping = aes(x = gender, fill = hair_color2)) + geom_bar() + coord_flip() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-31-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Segmented bar plots, proportions ```r ggplot(data = starwars, mapping = aes(x = gender, fill = hair_color2)) + geom_bar(position = "fill") + coord_flip() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-32-1.png" width="70%" style="display: block; margin: auto;" /> ```r labs(y = "proportion") ``` ``` ## $y ## [1] "proportion" ## ## attr(,"class") ## [1] "labels" ``` --- .discussion[ Which bar plot is a more useful representation for visualizing the relationship between gender and hair color? ] .pull-left[ <img src="02-visualize-data_files/figure-html/unnamed-chunk-33-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="02-visualize-data_files/figure-html/unnamed-chunk-34-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Scatter plot... This is not a great representation of these data. ```r ggplot(data = starwars, mapping = aes(y = height, x = gender)) + geom_point() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-35-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Violin plots ```r ggplot(data = starwars, mapping = aes(y = height, x = gender)) + geom_violin() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-36-1.png" width="75%" style="display: block; margin: auto;" /> --- ## Jitter plot ```r ggplot(data = starwars, mapping = aes(y = height, x = gender)) + geom_jitter() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-38-1.png" width="75%" style="display: block; margin: auto;" /> --- ## Beeswarm plots ```r library(ggbeeswarm) ggplot(data = starwars, mapping = aes(y = height, x = gender)) + geom_beeswarm() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-39-1.png" width="75%" style="display: block; margin: auto;" /> --- .your-turn[ Time to actually play around with the Star Wars dataset! - Go to RStudio Cloud ([rstd.io/dsbox-cloud](http://rstd.io/dsbox-cloud)) and start `AE 03 - StarWars + Data visualization`. - Open the R Markdown document and complete the exercise (and if time allows, the stretch goal exercise). ]
10
:
00
--- class: middle # Why do we visualize? --- ## Data: `datasaurus_dozen` Below is an exceprt from the `datasaurus_dozen` dataset: ``` ## # A tibble: 142 x 8 ## away_x away_y bullseye_x bullseye_y circle_x circle_y dino_x dino_y ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 32.3 61.4 51.2 83.3 56.0 79.3 55.4 97.2 ## 2 53.4 26.2 59.0 85.5 50.0 79.0 51.5 96.0 ## 3 63.9 30.8 51.9 85.8 51.3 82.4 46.2 94.5 ## 4 70.3 82.5 48.2 85.0 51.2 79.2 42.8 91.4 ## 5 34.1 45.7 41.7 84.0 44.4 78.2 40.8 88.3 ## 6 67.7 37.1 37.9 82.6 45.0 77.9 38.7 84.9 ## # … with 136 more rows ``` --- ## Summary statistics .small[ ```r datasaurus_dozen %>% group_by(dataset) %>% summarise(r = cor(x, y)) ``` ``` ## # A tibble: 13 x 2 ## dataset r ## <chr> <dbl> ## 1 away -0.0641 ## 2 bullseye -0.0686 ## 3 circle -0.0683 ## 4 dino -0.0645 ## 5 dots -0.0603 ## 6 h_lines -0.0617 ## # … with 7 more rows ``` ] --- ## .discussion[ How similar do the relationships between `x` and `y` in the thirteen datasets look? How similar are they based on summary stats? ] <img src="02-visualize-data_files/figure-html/datasaurus-plot-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Anscombe's quartet ```r library(Tmisc) quartet ``` .pull-left[ ``` ## set x y ## 1 I 10 8.04 ## 2 I 8 6.95 ## 3 I 13 7.58 ## 4 I 9 8.81 ## 5 I 11 8.33 ## 6 I 14 9.96 ## 7 I 6 7.24 ## 8 I 4 4.26 ## 9 I 12 10.84 ## 10 I 7 4.82 ## 11 I 5 5.68 ## 12 II 10 9.14 ## 13 II 8 8.14 ## 14 II 13 8.74 ## 15 II 9 8.77 ## 16 II 11 9.26 ## 17 II 14 8.10 ## 18 II 6 6.13 ## 19 II 4 3.10 ## 20 II 12 9.13 ## 21 II 7 7.26 ## 22 II 5 4.74 ``` ] .pull-right[ ``` ## set x y ## 23 III 10 7.46 ## 24 III 8 6.77 ## 25 III 13 12.74 ## 26 III 9 7.11 ## 27 III 11 7.81 ## 28 III 14 8.84 ## 29 III 6 6.08 ## 30 III 4 5.39 ## 31 III 12 8.15 ## 32 III 7 6.42 ## 33 III 5 5.73 ## 34 IV 8 6.58 ## 35 IV 8 5.76 ## 36 IV 8 7.71 ## 37 IV 8 8.84 ## 38 IV 8 8.47 ## 39 IV 8 7.04 ## 40 IV 8 5.25 ## 41 IV 19 12.50 ## 42 IV 8 5.56 ## 43 IV 8 7.91 ## 44 IV 8 6.89 ``` ] --- ## Summarising Anscombe's quartet ```r quartet %>% group_by(set) %>% summarise( mean_x = mean(x), mean_y = mean(y), sd_x = sd(x), sd_y = sd(y), r = cor(x, y) ) ``` ``` ## # A tibble: 4 x 6 ## set mean_x mean_y sd_x sd_y r ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 I 9 7.50 3.32 2.03 0.816 ## 2 II 9 7.50 3.32 2.03 0.816 ## 3 III 9 7.5 3.32 2.03 0.816 ## 4 IV 9 7.50 3.32 2.03 0.817 ``` --- ## Visualizing Anscombe's quartet ```r ggplot(quartet, aes(x = x, y = y)) + geom_point() + facet_wrap(~ set, ncol = 4) ``` <img src="02-visualize-data_files/figure-html/quartet-plot-1.png" width="75%" style="display: block; margin: auto;" /> --- ## Age at first kiss .discussion[ Do you see anything out of the ordinary? ] <img src="02-visualize-data_files/figure-html/unnamed-chunk-41-1.png" width="90%" style="display: block; margin: auto;" /> --- ## Facebook visits .discussion[ How are people reporting lower vs. higher values of FB visits? ] <img src="02-visualize-data_files/figure-html/unnamed-chunk-42-1.png" width="90%" style="display: block; margin: auto;" />