Data Science in a Box

2023 updates

What the box?

What is Data Science in a Box?

Curriculum for a semester-long introductory data science course, including lecture slides, in-class exercises, homework assignments, computing labs, projects, and interactive tutorials (distributed in a package, dsbox).

https://datasciencebox.org

Hex logo for data science in a box

Update cycle

  • Roughly annually, ideally over the summer, before the new academic year
    • This year, conf 2023 lines up nicely with this!
  • Goal: Summarize updates for a β€œTeaching the tidyverse in 2023” blog post

Two topics of discussion

  1. Where should the box live?
  2. Picking up a (data analysis) project ~1 year later
    • and the πŸ‡ holes that makes me go down every year

Where should the box live?

Current home

https://github.com/rstudio-education/datascience-box

RStudio Education organization on GitHub

Option 1: Leave as is

πŸ“š github.com/rstudio-education/datascience-box

πŸ“¦ github.com/rstudio-education/dsbox


Pros:

  • Less work
  • No worry about proper redirects

Cons:

  • rstudio-education is not an active organization

Option 2: Personal user account

πŸ“š github.com/mine-cetinkaya-rundel/datascience-box

πŸ“¦ github.com/mine-cetinkaya-rundel/dsbox


Pros:

  • Easy for me

Cons:

  • Lost in a sea of other repos
  • Doesn’t feel as inviting to collaborators (maybe?)

Option 2: New organization

πŸ“š github.com/datascience-box/datascience-box

πŸ“¦ github.com/datascience-box/dsbox


Pros:

  • Keep curricular materials and package repos together
  • Opportunity for a more future-proof name

Cons:

  • ?


This is what I’m most leaning towards. Any tips for moving a repo from one org to another and making sure redirects will work?

Picking up a (data analysis) project ~1 year later

Process

  • Phase 1 - Revive: Make sure all code runs with the latest versions of packages without errors (duh!) and without messages or warnings (or with messages or warnings addressed in the narrative.
  • Phase 2 - Clean up: Triage issues and PRs.
  • Phase 3 - Improve: Refresh content, datasets, topics, etc. as needed.

The rest of today’s discussion will be on Phase 1 but I’m always happy to talk about the other phases, particularly if you have ideas for Phase 3.

Phase 1 - Dealing with errors

  • Easy peasy to identify since rendering fails.
  • Not a big problem since volume is very low.

There was only 1 breaking change in 93 files, each with lots of code and with functions from (probably all) tidyverse packages and more.

Commit for fixing error due to using NULL instead of NA in if_else()

Phase 1 - Dealing with messages and warnings

TL;DR - much bigger headache…

  • If HTML files were checked in to Git, could be more straightforward to inspect diffs to catch them, but (1) I don’t like checking in HTML files and (2) the diffs will have a lot of noise and signal will likely get lost among them.
  • Ideas for more principled ways of catching them: Function lifecycle reporter by Mara

which brings us to…

What’s in a message? What’s in a warning?

The obligatory…

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
βœ” dplyr     1.1.2     βœ” readr     2.1.4
βœ” forcats   1.0.0     βœ” stringr   1.5.0
βœ” ggplot2   3.4.2     βœ” tibble    3.2.1
βœ” lubridate 1.9.2     βœ” tidyr     1.3.0
βœ” purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
βœ– dplyr::filter() masks stats::filter()
βœ– dplyr::lag()    masks stats::lag()
β„Ή Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(palmerpenguins)

Messages

I think some messages …

  • call to attention: informative, not expecting you to take action
  • call to action: expect you to read them and take action to make them go away

Messages: call to attention

penguins |>
  drop_na() |>
  ggplot(
    aes(x = body_mass_g, y = flipper_length_mm)
  ) +
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Messages: call to action

penguins |>
  group_by(island, species) |>
  summarize(mean_bm = mean(body_mass_g, na.rm = TRUE)) |>
  ungroup()
`summarise()` has grouped output by 'island'. You can override using the
`.groups` argument.
# A tibble: 5 Γ— 3
  island    species   mean_bm
  <fct>     <fct>       <dbl>
1 Biscoe    Adelie      3710.
2 Biscoe    Gentoo      5076.
3 Dream     Adelie      3688.
4 Dream     Chinstrap   3733.
5 Torgersen Adelie      3706.

Warnings

I think some warnings…

  • call to attention: informative, not expecting you to take action
  • call to action: expect you to read them and take action to make them go away

Warnings: call to attention

ggplot(
  penguins,
  aes(x = body_mass_g, y = flipper_length_mm)
  ) +
  geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).

Warnings: call to action

df2 <- tibble(
  x = c(1, 1, 2), 
  y = c("first", "second", "third")
)
df3 <- tibble(x = c(1, 1, 1, 3))
df3 |> left_join(df2)
Joining with `by = join_by(x)`
Warning in left_join(df3, df2): Detected an unexpected many-to-many relationship between `x` and `y`.
β„Ή Row 1 of `x` matches multiple rows in `y`.
β„Ή Row 1 of `y` matches multiple rows in `x`.
β„Ή If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# A tibble: 7 Γ— 2
      x y     
  <dbl> <chr> 
1     1 first 
2     1 second
3     1 first 
4     1 second
5     1 first 
6     1 second
7     3 <NA>  

… which brings up questions like

Why does using size instead of linewidth give a warning while not specifying .groups in summarize() give a message when they both seem to call to action?

# linewidth warning
penguins |>
  drop_na() |>
  ggplot(aes(x = body_mass_g, y = flipper_length_mm)) +
  geom_smooth(
    method = "lm", formula = "y ~ x", 
    size = 3
  )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
β„Ή Please use `linewidth` instead.

# .groups message
penguins |>
  group_by(island, species) |>
  summarize(
    mean_bm = mean(body_mass_g, na.rm = TRUE)
  )
`summarise()` has grouped output by 'island'. You can override using the
`.groups` argument.
# A tibble: 5 Γ— 3
# Groups:   island [3]
  island    species   mean_bm
  <fct>     <fct>       <dbl>
1 Biscoe    Adelie      3710.
2 Biscoe    Gentoo      5076.
3 Dream     Adelie      3688.
4 Dream     Chinstrap   3733.
5 Torgersen Adelie      3706.

TL;DR

  • Human review of changes is not always realistic/bullet proof, would love to revisit the function lifecycle reporter idea!
  • Detecting all warnings in all documents in a project is possible with options(warn = 2) but then you see every warning whether they’re meaningful or not.
    • Workaround: Add error: true as chunk option to chunks with β€œexpected” warnings (e.g., NAs in ggplots), but that doesn’t feel like a great solution.
  • Detecting messages, particularly new/changed ones, is even harder.