Midterm review
Lecture 11
Warm-up
While you wait…
Go to your ae project in RStudio
Make sure you have each piece of information we need extracted from The Chronicle opinion page – up to
create a data frame
inchronicle-scrape.R
.
Announcements
Midterm things:
Exam room: Bio Sci 111 or Gross Hall 107
Cheat sheet: 8.5x11, both sides, hand written or typed, any content you want, must be prepared by you
Also bring a pencil and eraser (you’re allowed to use a pen, but you might not want to)
Reminder: Academic dishonesty / Duke Community Standard
From last time: ae-10
Opinion articles in The Chronicle
- Scrape data and organize it in a tidy format in R
- Perform light text parsing to clean data
- Summarize and visualize the data
ae-10-chronicle-scrape
Go to your ae project in RStudio.
Open
chronicle-scrape.R
andae-10-chronicle-scrape.qmd
.
Recap
- Use the SelectorGadget identify tags for elements you want to grab
- Use rvest to first read the whole page (into R) and then parse the object you’ve read in to the elements you’re interested in
- Put the components together in a data frame (a tibble) and analyze it like you analyze any other data
A new R workflow
When working in a Quarto document, your analysis is re-run each time you knit
If web scraping in a Quarto document, you’d be re-scraping the data each time you knit, which is undesirable (and not nice)!
-
An alternative workflow:
- Use an R script to save your code
- Saving interim data scraped using the code in the script as CSV or RDS files
- Use the saved data in your analysis in your Quarto document
From a previous time: ae-09-age-gaps-sales-import
- Part 2
ae-09-age-gaps-sales-import
Go to your ae project in RStudio.
Open
ae-09-age-gaps-sales-import.qmd
- Part 2.
Review: Quarto workflow
Your document environment vs. your global environment
Objects you define in your Quarto document are available in your Quarto document, and if you ran that code cell individually, they will also be available in your global environment
Objects you define in your global environment are not, by default, available in your Quarto document
Recipe for success for reproducible documents
Render – after each “win” or at each stopping point
Commit – all files, with a commit message that describes the substance of what you did
Push – to make sure files are updated on GitHub as well
Questions… or…
Time permitting: Type coercion
Explicit vs. implicit type coercion
Explicit type coercion: You ask R to change the type of a variable
-
Implicit type coercion: R changes / makes assumptions for you about the type of a variable without you asking for it
- This happens because in a vector, you can’t have multiple types of values
Vectors
-
A vector is a collection of values
Atomic vectors can only contain values of the same type
Lists can contain values of different types
Why do we care? Because each column of a data frame is a vector.
. . .
Explicit coercion
✅ From numeric to character
Explicit coercion
❌ From character to numeric
Implicit coercion
Which of the column types were implicitly coerced?
Collecting data
Suppose you conduct a survey and ask students their student ID number and number of credits they’re taking this semester. What is the type of each variable?
. . .
Cleaning data
survey <- survey_raw |>
mutate(
student_id = if_else(student_id == "I don't remember", NA, student_id),
n_credits = case_when(
n_credits == "I'm not sure yet" ~ NA,
n_credits == "2 - underloading" ~ "2",
.default = n_credits
),
n_credits = as.numeric(n_credits)
)
survey
# A tibble: 4 × 2
student_id n_credits
<chr> <dbl>
1 273674 4
2 298765 4.5
3 287129 NA
4 <NA> 2
Cleaning data – alternative
survey <- survey_raw |>
mutate(
student_id = parse_number(student_id),
n_credits = parse_number(n_credits)
)
Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `student_id = parse_number(student_id)`.
Caused by warning:
! 1 parsing failure.
row col expected actual
4 -- a number I don't remember
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
# A tibble: 4 × 2
student_id n_credits
<dbl> <dbl>
1 273674 4
2 298765 4.5
3 287129 NA
4 NA 2
Recap: Type coercion
If variables in a data frame have multiple types of values, R will coerce them into a single type, which may or may not be what you want.
If what R does by default is not what you want, you can use explicit coercion functions like
as.numeric()
,as.character()
, etc. to turn them into the types you want them to be, which will generally also involve cleaning up the features of the data that caused the unwanted implicit coercion in the first place.
Time permitting: Aesthetic mappings
openintro::loan50
Aesthetic mappings
What will the following code result in?
Global mappings
What will the following code result in?
Local mappings
What will the following code result in?
Mapping vs. setting
What will the following code result in?
Recap: Aesthetic mappings
Aesthetic mapping defined at the global level will be used by all
geom
s for which the aesthetic is defined.Aesthetic mapping defined at the local level will be used only by the
geom
s they’re defined for.
Aside: Legends
Aside: Legends
Aside: Legends
Time permitting: Factors
Factors
Factors are used for categorical variables – variables that have a fixed and known set of possible values.
They are also useful when you want to display character vectors in a non-alphabetical order.
Bar plot
Bar plot - reordered
Frequency table
Bar plot - reordered
Under the hood
. . .
. . .
Recap: Factors
The forcats package has a bunch of functions (that start with
fct_*()
) for dealing with factors and their levels: https://forcats.tidyverse.org/reference/index.htmlFactors and the order of their levels are relevant for displays (tables, plots) and they’ll be relevant for modeling (later in the course)
factor
is a data class