class: center, middle, inverse, title-slide # Literate programming with
R Markdown ## Teaching Data Science, Reproducibly
@ ICOTS 2018 ### Mine Çetinkaya-Rundel & Colin Rundel ### July 7, 2018 --- class: center, middle # Reproducibility: who cares? --- ## Science retracts gay marriage paper without agreement of lead author - In May 2015 Science retracted a study of how canvassers can sway people's opinions about gay marriage published just 5 months earlier. - Science Editor-in-Chief Marcia McNutt: Original survey data not made available for independent reproduction of results. + Survey incentives misrepresented. + Sponsorship statement false. - Two Berkeley grad students who attempted to replicate the study quickly discovered that the data must have been faked. - Methods we'll discuss today can't prevent this, but they can make it easier to discover issues. <font size="3">Source: http://news.sciencemag.org/policy/2015/05/science-retracts-gay-marriage-paper-without-lead-author-s-consent</font> --- ## Bad spreadsheet merge kills depression paper, quick fix resurrects it - **Original conclusion:** Lower levels of CSF IL-6 were associated with current depression and with future depression [...]. - **Revised conclusion:** Higher levels of CSF IL-6 and IL-8 were associated with current depression [...]. <br><br><br><br><br> <font size="3">Source: http://retractionwatch.com/2014/07/01/bad-spreadsheet-merge-kills-depression-paper-quick-fix-resurrects-it/</font> --- ## Divorce study felled by a coding error gets a second chance - **Original conclusion:** The risk of divorce in a heterosexual marriage increases when the wife falls ill, but not the husband. - **Corrected conclusion:** Based on the corrected analysis, we conclude that there are not gender differences in the relationship between gender, pooled illness onset, and divorce. <br><br><br><br><br><br> <font size="3">Source: http://retractionwatch.com/2015/09/10/divorce-study-felled-by-a-coding-error-gets-a-second-chance/#more-32151</font> --- ## Divorce study retraction: Editor's note - "The research environment is fast-paced given the ethos to “publish or perish"." - "[...] research is becoming increasingly complex, with greater calls for transdisciplinary collaborations, “big data,” and more sophisticated research questions and methods [...] data sets often have multiple files that require merging, change the wording of questions over time, provide incomplete codebooks, and have unclear and sometimes duplicative variables." - "Given these issues, I would not be surprised if coding errors were fairly common [...]" <br> <font size="3">Source: http://retractionwatch.com/2015/09/10/divorce-study-felled-by-a-coding-error-gets-a-second-chance/#more-32151</font> --- class: center, middle # Reproducibility: why should you care? --- ## Think back to every time... - The results in Table 1 don't seem to correspond to those in Figure 2. - In what order do I run these scripts? - Where did we get this data file? - Why did I omit those samples? - How did I make that figure? - "Your script is now giving an error." - "The attached is similar to the code we used." <br><br><br><br><br> <font size="3">Source: Karl Broman</font> --- ## No collaborators? <br><br><br><br> >Your closest collaborator is you six months ago, <br> but you don’t reply to emails. <br><br> <font size="3">- Mark Holder</font> <br><br><br> --- ## Two pronged approach \#1 Convince researchers to adopt a reproducible research workflow <br><br> \#2 Train new researchers who don’t have any other workflow --- class: center, middle # Reproducibility: how? --- ## Reproducibility checklist - Are the tables and figures reproducible from the code and data? - Does the code actually do what you think it does? - In addition to what was done, is it clear *why* it was done? (e.g., how were parameter settings chosen?) - Can the code be used for other data? - Can you extend the code to do other things? --- ## Ambitious goal + many other concerns We need an environment where - data, analysis, and results are tightly connected, or better yet, inseparable - reproducibility is built in + the original data remains untouched + all data manipulations and analyses are inherently documented - documentation is human readable and syntax is minimal --- ## Roadmap 1. Scriptability `\(\rightarrow\)` R 2. Literate programming `\(\rightarrow\)` R Markdown 3. Version control `\(\rightarrow\)` Git / GitHub --- class: center, middle # Literate Programming --- ## Donald Knuth "Literate Programming (1983)" "Instead of imagining that our main task is to instruct a *computer* what to do, let us concentrate rather on explaining to *human beings* what we want a computer to do." "The practitioner of literate programming [...] strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other." - These ideas have been around for years! - And tools for putting them to practice have also been around - But they have never been as accessible as the current tools: R Markdown --- ## What is Markdown? - Markdown is a lightweight markup language for creating HTML (or XHTML) documents. - Markup languages are designed to produce documents from human readable text (and annotations). - Some of you may be familiar with LaTeX. This is another (less human friendly) markup language for creating pdf documents. - Why I love Markdown: + Simple syntax means easy to learn and use. + Focus on **content**, rather than **coding** and debugging **errors**. + Allows for easy web authoring. + Once you have the basics down, you can get fancy and add HTML, JavaScript, and CSS. --- ## Sample Markdown document <img src="img/markdown.png" alt="markdown" style="width:900px"> --- ## What is R Markdown? Well, it's R + Markdown: - Ease of Markdown syntax - Rendering of R code to produce output and plots - Ability to include LaTeX: `\(\hat{y} = \beta_0 + \beta_1 \times x\)` --- ## Sample R Markdown document <img src="img/rmarkdown.png" alt="rmarkdown" style="width:1000px"> --- ## Another R Markdown document <br><br><br><br> <center> This presentation! </center> --- class: center, middle # R Markdown --- ## It's your lucky day! You got some data. - `WorldCupMatches-01.csv`: Match info for each game in pre-2000 World Cups - Codebook at https://github.com/mine-cetinkaya-rundel/teach-data-sci-icots2018/blob/master/01-03-rmarkdown/data/README.md - Ultimate goal: Visualize the total number of goals for each World Cup over time. .instructions[ Go to the project called **World Cup Goals** on RStudio Cloud and open `world-cup-goals.Rmd`. Knit the document. Then, update the **yaml** with your information, and knit again. ] --- ## The YAML YAML: Yet another Markdown language - Fields like `title`, `subtitle`, `author`, `date` - You can also change `output` formats: `html_document` for web authoring, `github_document` for markdown document easily viewable on GitHub, `pdf_document` for PDF (requires TeX), `word_document` for MS Word (requires Word) - Can use inline R code in values (see `date`) --- ## Chunk options - Turn off messages with `message = FALSE` - Turn off warnings with `warning = FALSE` - Hide code with `echo = FALSE` - Exclude chunk from doc with `include = FALSE` to prevent code and results from appearing in the finished file. Code in the chunk will still be ran, and the results can be used by other chunks. - Display error messages in document with `error = TRUE`, as opposed to stopping render when errors occur `error = FALSE`, which is the default - Set these per chunk or globally in a `setup` chunk on top of the document with `knitr::opts_chunk$set(...)` <img src="img/chunk-opts.png" alt="markdown" style="width:900px"> --- ## Not so lucky after all .instructions[ Turns out there is an error in the data you received: The number of `home_team_goals` in 1998 by Brazil (in the game vs. Denmark played on 03 Jul 1998) should be 3, not 0. Implement a fix and redo the analysis. ] --- ## More data! .instructions[ And now you received more data: World Cup matches post-2000. The data are in `data/WorldCupMatches-02.csv`. Redo the analysis combining data from both files. ] --- ## Tips - Make sure RStudio and the `rmarkdown` package (and its dependencies) are up-to-date. -- - Set a global option for `error = TRUE` (or for a given chunk) so that your document renders even when there are errors. -- - Don’t try to change working directory within an R Markdown document. (If you do still decide to use setwd in a code chunk, beware that the new working directory will only apply to that specific code chunk, and any following code chunks will revert back to use the original working directory.)