Lecture 11
Duke University
STA 199 - Fall 2024
October 3, 2024
Go to your ae project in RStudio
Make sure you have each piece of information we need extracted from The Chronicle opinion page – up to create a data frame
in chronicle-scrape.R
.
Midterm things:
Exam room: Bio Sci 111 or Gross Hall 107
Cheat sheet: 8.5x11, both sides, hand written or typed, any content you want, must be prepared by you
Also bring a pencil and eraser (you’re allowed to use a pen, but you might not want to)
Reminder: Academic dishonesty / Duke Community Standard
ae-10
ae-10-chronicle-scrape
Go to your ae project in RStudio.
Open chronicle-scrape.R
and ae-10-chronicle-scrape.qmd
.
When working in a Quarto document, your analysis is re-run each time you knit
If web scraping in a Quarto document, you’d be re-scraping the data each time you knit, which is undesirable (and not nice)!
An alternative workflow:
ae-09-age-gaps-sales-import
- Part 2ae-09-age-gaps-sales-import
Go to your ae project in RStudio.
Open ae-09-age-gaps-sales-import.qmd
- Part 2.
Objects you define in your Quarto document are available in your Quarto document, and if you ran that code cell individually, they will also be available in your global environment
Objects you define in your global environment are not, by default, available in your Quarto document
Render – after each “win” or at each stopping point
Commit – all files, with a commit message that describes the substance of what you did
Push – to make sure files are updated on GitHub as well
Explicit type coercion: You ask R to change the type of a variable
Implicit type coercion: R changes / makes assumptions for you about the type of a variable without you asking for it
A vector is a collection of values
Atomic vectors can only contain values of the same type
Lists can contain values of different types
Why do we care? Because each column of a data frame is a vector.
✅ From numeric to character
❌ From character to numeric
Which of the column types were implicitly coerced?
Suppose you conduct a survey and ask students their student ID number and number of credits they’re taking this semester. What is the type of each variable?
survey <- survey_raw |>
mutate(
student_id = if_else(student_id == "I don't remember", NA, student_id),
n_credits = case_when(
n_credits == "I'm not sure yet" ~ NA,
n_credits == "2 - underloading" ~ "2",
.default = n_credits
),
n_credits = as.numeric(n_credits)
)
survey
# A tibble: 4 × 2
student_id n_credits
<chr> <dbl>
1 273674 4
2 298765 4.5
3 287129 NA
4 <NA> 2
survey <- survey_raw |>
mutate(
student_id = parse_number(student_id),
n_credits = parse_number(n_credits)
)
Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `student_id = parse_number(student_id)`.
Caused by warning:
! 1 parsing failure.
row col expected actual
4 -- a number I don't remember
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
# A tibble: 4 × 2
student_id n_credits
<dbl> <dbl>
1 273674 4
2 298765 4.5
3 287129 NA
4 NA 2
If variables in a data frame have multiple types of values, R will coerce them into a single type, which may or may not be what you want.
If what R does by default is not what you want, you can use explicit coercion functions like as.numeric()
, as.character()
, etc. to turn them into the types you want them to be, which will generally also involve cleaning up the features of the data that caused the unwanted implicit coercion in the first place.
openintro::loan50
What will the following code result in?
What will the following code result in?
What will the following code result in?
What will the following code result in?
Aesthetic mapping defined at the global level will be used by all geom
s for which the aesthetic is defined.
Aesthetic mapping defined at the local level will be used only by the geom
s they’re defined for.
Factors are used for categorical variables – variables that have a fixed and known set of possible values.
They are also useful when you want to display character vectors in a non-alphabetical order.
The forcats package has a bunch of functions (that start with fct_*()
) for dealing with factors and their levels: https://forcats.tidyverse.org/reference/index.html
Factors and the order of their levels are relevant for displays (tables, plots) and they’ll be relevant for modeling (later in the course)
factor
is a data class
==
|
|