Grammar of data transformation

Lecture 3

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Fall 2024

September 5, 2024

Warm-up

Announcements

  • AE 01 and AE 02 suggested answers posted on the course website.

  • Ed Discussion posts:

    • Keep them coming!
    • Keep them public!
    • Feel free to make them anonymous if you prefer.
  • Monday’s lab:

    • Show up on time
    • Show up to learn and get support
    • Leave with all questions answered and a plan to work on the lab throughout the week

From last time

ae-02-bechdel-dataviz

Go to RStudio, confirm that you’re in the ae project, and open the document ae-02-bechdel-dataviz.qmd.

Recap: Code cells (aka code chunks)

. . .

  • Cell labels are helpful for describing what the code is doing, for jumping between code cells in the editor, and for troubleshooting

  • message: false hides any messages emitted by the code in your rendered document

Application exercise

ae-03-bechdel-data-viz-transform

  • Go to your ae project in RStudio.

  • Make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.

  • If you haven’t yet done so, click Pull to get today’s application exercise file: ae-03-bechdel-data-viz-transform.qmd.

  • Work through the application exercise in class, and render, commit, and push your edits by the end of class.

Data transformation

A quick reminder

1bechdel |>
2  filter(roi > 400) |>
3  select(title, roi, budget_2013, gross_2013, year, clean_test)
1
Start with the bechdel data frame
2
Filter for movies with roi greater than 400 (gross is more than 400 times budget)
3
Select the columns title, roi, budget_2013, gross_2013, year, and clean_test
# A tibble: 3 × 6
  title                    roi budget_2013 gross_2013  year clean_test
  <chr>                  <dbl>       <dbl>      <dbl> <dbl> <chr>     
1 Paranormal Activity     671.      505595  339424558  2007 dubious   
2 The Blair Witch Proje…  648.      839077  543776715  1999 ok        
3 El Mariachi             583.       11622    6778946  1992 nowomen   

The pipe |>

The pipe operator passes what comes before it into the function that comes after it as the first argument in that function.

sum(1, 2)
[1] 3
1 |> 
  sum(2)
[1] 3


select(filter(bechdel, roi > 400), title)
# A tibble: 3 × 1
  title                  
  <chr>                  
1 Paranormal Activity    
2 The Blair Witch Project
3 El Mariachi            
bechdel |>
  filter(roi > 400) |>
  select(title)
# A tibble: 3 × 1
  title                  
  <chr>                  
1 Paranormal Activity    
2 The Blair Witch Project
3 El Mariachi            

Code style tip

  • In data transformation pipelines, always use a
    • space before |>
    • line break after |>
    • indent the next line of code
  • In data visualization layers, always use a
    • space before +
    • line break after +
    • indent the next line of code

The pipe, in action

Find movies that pass the Bechdel test and display their titles and ROIs in descending order of ROI.

Start with the bechdel data frame:

The pipe, in action

Find movies that pass the Bechdel test and display their titles and ROIs in descending order of ROI.

Filter for rows where binary is equal to "PASS":

The pipe, in action

Find movies that pass the Bechdel test and display their titles and ROIs in descending order of ROI.

Arrange the rows in descending order of roi:

The pipe, in action

Find movies that pass the Bechdel test and display their titles and ROIs in descending order of ROI.

Select columns title and roi:

Time permitting

Ask another question of the data that can be answered with a data transformation pipeline.