Midterm review

Lecture 11

Author
Affiliation

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Fall 2024

Published

October 3, 2024

Warm-up

While you wait…

  • Go to your ae project in RStudio

  • Make sure you have each piece of information we need extracted from The Chronicle opinion page – up to create a data frame in chronicle-scrape.R.

Announcements

Midterm things:

  • Exam room: Bio Sci 111 or Gross Hall 107

  • Cheat sheet: 8.5x11, both sides, hand written or typed, any content you want, must be prepared by you

  • Also bring a pencil and eraser (you’re allowed to use a pen, but you might not want to)

  • Reminder: Academic dishonesty / Duke Community Standard

From last time: ae-10

Opinion articles in The Chronicle

  • Scrape data and organize it in a tidy format in R
  • Perform light text parsing to clean data
  • Summarize and visualize the data

ae-10-chronicle-scrape

  • Go to your ae project in RStudio.

  • Open chronicle-scrape.R and ae-10-chronicle-scrape.qmd.

Recap

  • Use the SelectorGadget identify tags for elements you want to grab
  • Use rvest to first read the whole page (into R) and then parse the object you’ve read in to the elements you’re interested in
  • Put the components together in a data frame (a tibble) and analyze it like you analyze any other data

A new R workflow

  • When working in a Quarto document, your analysis is re-run each time you knit

  • If web scraping in a Quarto document, you’d be re-scraping the data each time you knit, which is undesirable (and not nice)!

  • An alternative workflow:

    • Use an R script to save your code
    • Saving interim data scraped using the code in the script as CSV or RDS files
    • Use the saved data in your analysis in your Quarto document

From a previous time: ae-09-age-gaps-sales-import - Part 2

ae-09-age-gaps-sales-import

  • Go to your ae project in RStudio.

  • Open ae-09-age-gaps-sales-import.qmd - Part 2.

Review: Quarto workflow

Your document environment vs. your global environment

  • Objects you define in your Quarto document are available in your Quarto document, and if you ran that code cell individually, they will also be available in your global environment

  • Objects you define in your global environment are not, by default, available in your Quarto document

Recipe for success for reproducible documents

  • Render – after each “win” or at each stopping point

  • Commit – all files, with a commit message that describes the substance of what you did

  • Push – to make sure files are updated on GitHub as well

Questions… or…

Time permitting: Type coercion

Explicit vs. implicit type coercion

  • Explicit type coercion: You ask R to change the type of a variable

  • Implicit type coercion: R changes / makes assumptions for you about the type of a variable without you asking for it

    • This happens because in a vector, you can’t have multiple types of values

Vectors

  • A vector is a collection of values

    • Atomic vectors can only contain values of the same type

    • Lists can contain values of different types

  • Why do we care? Because each column of a data frame is a vector.

. . .

df <- tibble(
  x = c(1, 2, 3),          # numeric (double)
  y = c("a", "b", "c"),    # character
  z = c(TRUE, FALSE, TRUE) # logical
)
df
# A tibble: 3 × 3
      x y     z    
  <dbl> <chr> <lgl>
1     1 a     TRUE 
2     2 b     FALSE
3     3 c     TRUE 

Explicit coercion

✅ From numeric to character

df |>
  mutate(x_new = as.character(x))
# A tibble: 3 × 4
      x y     z     x_new
  <dbl> <chr> <lgl> <chr>
1     1 a     TRUE  1    
2     2 b     FALSE 2    
3     3 c     TRUE  3    

Explicit coercion

❌ From character to numeric

df |>
  mutate(y_new = as.numeric(y))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `y_new = as.numeric(y)`.
Caused by warning:
! NAs introduced by coercion
# A tibble: 3 × 4
      x y     z     y_new
  <dbl> <chr> <lgl> <dbl>
1     1 a     TRUE     NA
2     2 b     FALSE    NA
3     3 c     TRUE     NA

Implicit coercion

Which of the column types were implicitly coerced?

df <- tibble(
  w = c(1, 2, 3),
  x = c("a", "b", 4),
  y = c("c", "d", NA),
  z = c(5, 6, NA),
)
df
# A tibble: 3 × 4
      w x     y         z
  <dbl> <chr> <chr> <dbl>
1     1 a     c         5
2     2 b     d         6
3     3 4     <NA>     NA

Collecting data

Suppose you conduct a survey and ask students their student ID number and number of credits they’re taking this semester. What is the type of each variable?

. . .

survey_raw <- tibble(
  student_id = c(273674, 298765, 287129, "I don't remember"),
  n_credits = c(4, 4.5, "I'm not sure yet", "2 - underloading")
)
survey_raw
# A tibble: 4 × 2
  student_id       n_credits       
  <chr>            <chr>           
1 273674           4               
2 298765           4.5             
3 287129           I'm not sure yet
4 I don't remember 2 - underloading

Cleaning data

survey <- survey_raw |>
  mutate(
    student_id = if_else(student_id == "I don't remember", NA, student_id),
    n_credits = case_when(
      n_credits == "I'm not sure yet" ~ NA,
      n_credits == "2 - underloading" ~ "2",
      .default = n_credits
    ),
    n_credits = as.numeric(n_credits)
  )
survey
# A tibble: 4 × 2
  student_id n_credits
  <chr>          <dbl>
1 273674           4  
2 298765           4.5
3 287129          NA  
4 <NA>             2  

Cleaning data – alternative

survey <- survey_raw |>
  mutate(
    student_id = parse_number(student_id),
    n_credits = parse_number(n_credits)
  )
Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `student_id = parse_number(student_id)`.
Caused by warning:
! 1 parsing failure.
row col expected           actual
  4  -- a number I don't remember
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
survey
# A tibble: 4 × 2
  student_id n_credits
       <dbl>     <dbl>
1     273674       4  
2     298765       4.5
3     287129      NA  
4         NA       2  

Recap: Type coercion

  • If variables in a data frame have multiple types of values, R will coerce them into a single type, which may or may not be what you want.

  • If what R does by default is not what you want, you can use explicit coercion functions like as.numeric(), as.character(), etc. to turn them into the types you want them to be, which will generally also involve cleaning up the features of the data that caused the unwanted implicit coercion in the first place.

Time permitting: Aesthetic mappings

openintro::loan50

Aesthetic mappings

What will the following code result in?

Global mappings

What will the following code result in?

Local mappings

What will the following code result in?

Mapping vs. setting

What will the following code result in?

Recap: Aesthetic mappings

  • Aesthetic mapping defined at the global level will be used by all geoms for which the aesthetic is defined.

  • Aesthetic mapping defined at the local level will be used only by the geoms they’re defined for.

Aside: Legends

Aside: Legends

ggplot(
  loan50,
  aes(x = annual_income, y = interest_rate, color = homeownership, shape = homeownership)
) +
  geom_point() +
  scale_color_colorblind() +
  labs(color = "Home ownership")

Aside: Legends

ggplot(
  loan50,
  aes(x = annual_income, y = interest_rate, color = homeownership, shape = homeownership)
) +
  geom_point() +
  scale_color_colorblind() +
  labs(
    color = "Home ownership",
    shape = "Home ownership"
  )

Time permitting: Factors

Factors

  • Factors are used for categorical variables – variables that have a fixed and known set of possible values.

  • They are also useful when you want to display character vectors in a non-alphabetical order.

Bar plot

Bar plot - reordered

Frequency table

Bar plot - reordered

Under the hood

. . .

. . .

Recap: Factors

  • The forcats package has a bunch of functions (that start with fct_*()) for dealing with factors and their levels: https://forcats.tidyverse.org/reference/index.html

  • Factors and the order of their levels are relevant for displays (tables, plots) and they’ll be relevant for modeling (later in the course)

  • factor is a data class

Aside: ==

Aside: |

Aside: |

Other questions?