Activity #2

Use ellmer to interact with the an LLM (Anthropic, in this case) to generate feedback.

Tooling

prompt.R:

function (question, rubric, answer) 
{
    chat <- chat_anthropic(system_prompt = paste("You are a helpful course instructor teaching a course on data science with the R programming language and the tidyverse and tidymodels suite of packages. You like to give succinct but precise feedback.", 
        glue_collapse(read_lines("system-prompt/pipes.qmd"))))
    chat$chat(glue("Carefully read the {question} and the {rubric_detailed}, then evaluate {answer} against the {rubric_detailed} to provide feedback. \n      Provide feedback in an output section named **Feedback:**. \n      Format the feedback as bullet points: Each bullet point should first state the rubric item text from {rubric}, and then provide one sentence explaining whether the {answer} meets the {rubric} item.\n      Do not give away the correct answer in the feedback."))
}

Note: The system prompt points to a file that has the contents of the pipes chapter of the Tidyverse style guide.

Example: Tidying

Below is a question on tidying.

For this question you will work with inflation data from various countries in the world over the last 30 years. The dataset is called country-inflation.csv and it’s in your data folder.

Reshape (pivot) country_inflation such that each row represents a country/year combination, with columns country, year, and annual_inflation. Then, display the resulting data frame and state how many rows and columns it has.

Code for constructing function inputs

question <- read_lines("questions/tidy-country-inflation-pivot-longer/tidy-country-inflation-pivot-longer-Q.qmd") |>
  glue_collapse() |>
  str_remove("---.*?---")

rubric <- read_lines("questions/tidy-country-inflation-pivot-longer/tidy-country-inflation-pivot-longer-R.qmd") |>
  glue_collapse() |>
  str_remove("---.*?---")

rubric_detailed <- read_lines("questions/tidy-country-inflation-pivot-longer/tidy-country-inflation-pivot-longer-RD.qmd") |>
  glue_collapse() |>
  str_remove("---.*?---")

bad_answer <- read_lines("questions/tidy-country-inflation-pivot-longer/tidy-country-inflation-pivot-longer-sample-answer-bad.qmd") |>
  glue_collapse() |>
  str_remove("---.*?---")

good_answer <- read_lines("questions/tidy-country-inflation-pivot-longer/tidy-country-inflation-pivot-longer-sample-answer-good.qmd") |>
  glue_collapse() |>
  str_remove("---.*?---")

Below is what we might consider a bad answer to this question.

library(tidyverse)

country_inflation <- read_csv("data/country-inflation.csv")

There are 3 columns and 1178 rows in this new data frame.

df <- country_inflation |> 
    pivot_longer(
    cols = -country,names_to = "year", 
    values_to = "inflationrate"
               )
print(df)

# A tibble: 1,178 × 3
   country   year  inflationrate
   <chr>     <chr>         <dbl>
 1 Australia 1993          1.75 
 2 Australia 1994          1.97 
 3 Australia 1995          4.63 
 4 Australia 1996          2.62 
 5 Australia 1997          0.225
 6 Australia 1998          0.860
 7 Australia 1999          1.48 
 8 Australia 2000          4.46 
 9 Australia 2001          4.41 
10 Australia 2002          2.98 
# ℹ 1,168 more rows

Provide feedback to the answer above using the following rubric.

Code uses pivot_longer().
Code names the data frame something short and informative.
Code names the year variable year and the inflation variable annual_inflation.
Code transforms the year variable to numeric inside pivot_longer().
Output displays country_inflation_longer.
Output has 3 columns: country, year, and annual_inflation.
Narrative states the correct numbers of rows and columns.
Code style and readability: Line breaks after each |>, proper indentation, spaces around = signs if they are present, and spaces after commas if they are present.

prompt(
  question = question, 
  rubric = rubric,
  answer = bad_answer
)

Using model = "claude-sonnet-4-20250514".

**Feedback:**

- **Code uses `pivot_longer()`**: ✓ The code correctly uses `pivot_longer()` to
reshape the data from wide to long format.

- **Code names the data frame something short and informative**: ✗ The data 
frame is named `df` which is too generic and not informative about what the 
data contains.

- **Code names the year variable `year` and the inflation variable 
`annual_inflation`**: ✗ While the year variable is correctly named `year`, the 
inflation variable is named `inflationrate` instead of the required 
`annual_inflation`.

- **Code transforms the year variable to numeric inside `pivot_longer()`**: ✗ 
The code is missing the `names_transform = as.numeric` argument that was 
specifically required to convert the year variable to numeric type.

- **Output displays `country_inflation_longer`**: ✗ The output shows `df` 
instead of the expected `country_inflation_longer` data frame name.

- **Output has 3 columns: `country`, `year`, and `annual_inflation`**: ✗ While 
there are 3 columns, the inflation column is named `inflationrate` rather than 
`annual_inflation`.

- **Narrative states the correct numbers of rows and columns**: ✗ The narrative
states "1178 rows and 3" but should state "1,178 rows and 3 columns" for 
complete accuracy.

- **Code style and readability**: ✗ The code has inconsistent spacing (missing 
spaces around `=` in `names_to = "year"` and `values_to = "inflationrate"`), 
and the closing parenthesis alignment could be improved.

Provide feedback to the same answer from the previous tab.

prompt(
  question = question, 
  rubric = rubric,
  answer = bad_answer
)

Using model = "claude-sonnet-4-20250514".

**Feedback:**

• **Code uses `pivot_longer()`** - ✓ The code correctly uses `pivot_longer()` 
to reshape the data from wide to long format.

• **Code names the data frame something short and informative** - ✗ The data 
frame is named `df` which is not informative about what the data contains.

• **Code names the year variable `year` and the inflation variable 
`annual_inflation`** - ✗ While the year variable is correctly named `year`, the
inflation variable is named `inflationrate` instead of the required 
`annual_inflation`.

• **Code transforms the year variable to numeric inside `pivot_longer()`** - ✗ 
The code is missing the `names_transform = as.numeric` argument that was 
specifically required to convert the year variable to numeric during pivoting.

• **Output displays `country_inflation_longer`** - ✗ The output shows a data 
frame named `df` rather than the expected `country_inflation_longer`.

• **Output has 3 columns: `country`, `year`, and `annual_inflation`** - ✗ While
there are 3 columns, the third column is named `inflationrate` rather than 
`annual_inflation`.

• **Narrative states the correct numbers of rows and columns** - ✗ The 
narrative states 1178 rows and 3 columns but places this information before the
code chunk rather than after displaying the results.

• **Code style and readability** - ✗ The code has inconsistent spacing (missing
spaces after commas in some places) and could benefit from better formatting 
alignment within the `pivot_longer()` function.

Below is what we might consider a good answer to this question.

library(tidyverse)

country_inflation <- read_csv("data/country-inflation.csv")

The data frame, “yearly_inflation,” has 1,178 rows and three columns.

yearly_inflation <- country_inflation |>
  pivot_longer(
    cols = `1993`:`2023`,
    names_to = "year",
    names_transform = as.numeric,
    values_to = "annual_inflation"
  )

yearly_inflation

# A tibble: 1,178 × 3
   country    year annual_inflation
   <chr>     <dbl>            <dbl>
 1 Australia  1993            1.75 
 2 Australia  1994            1.97 
 3 Australia  1995            4.63 
 4 Australia  1996            2.62 
 5 Australia  1997            0.225
 6 Australia  1998            0.860
 7 Australia  1999            1.48 
 8 Australia  2000            4.46 
 9 Australia  2001            4.41 
10 Australia  2002            2.98 
# ℹ 1,168 more rows

Provide feedback to the answer above.

prompt(
  question = question, 
  rubric = rubric,
  answer = good_answer
)

Using model = "claude-sonnet-4-20250514".

**Feedback:**

• **Code uses `pivot_longer()`:** ✓ The code correctly uses `pivot_longer()` to
reshape the data from wide to long format.

• **Code names the data frame something short and informative:** ✓ The data 
frame is named `yearly_inflation`, which is both concise and descriptive of its
contents.

• **Code names the year variable `year` and the inflation variable 
`annual_inflation`:** ✓ The code properly uses `names_to = "year"` and 
`values_to = "annual_inflation"` to create the required column names.

• **Code transforms the year variable to numeric inside `pivot_longer()`:** ✓ 
The code includes `names_transform = as.numeric` to convert the year variable 
to numeric type during the pivot operation.

• **Output displays `country_inflation_longer`:** ✗ The rubric expects the 
output to display a data frame named `country_inflation_longer`, but the code 
creates and displays `yearly_inflation` instead.

• **Output has 3 columns: `country`, `year`, and `annual_inflation`:** ✓ The 
resulting data frame contains the three required columns with the correct 
names.

• **Narrative states the correct numbers of rows and columns:** ✓ The narrative
correctly states that the data frame has 1,178 rows and 3 columns.

• **Code style and readability:** ✓ The code follows proper tidyverse style 
with line breaks after the pipe operator, appropriate indentation, and proper 
spacing around operators and after commas.

Example: Modeling

Below is a question on modeling:

For this part of the analysis, the data set that will be used is called births14 , and it is included in the openintro package you loaded at the beginning of the assignment. Note: this question relies on an earlier part of the assignment, where the births14 dataset is filtered to remove NA values and assigned the name births14_habitgiven.

Fit a linear model that predicts weight from habit and save the model object. Then, provide the tidy summary output.
Write the estimated least squares regression line below using proper notation.
Interpret the intercept in the context of the data and the research question. Is the intercept meaningful in this context? Why or why not?
Interpret the slope in the context of the data and the research question.