Lab 8
Everything so far II
Project next steps
Before diving into the lab, let’s first check in on your project’s next steps.
-
Before Monday, Nov 25 - Between Milestones 3 and 4:
- Read the instructions for Milestone 4 and Milestone 5.
- Bare minimum: Make substantial progress on your project write-up: Finish your introduction and exploratory data analysis (plots/summary statistics + their interpretations) + Write up methods you plan to use.
- Ideal: Start implementing the methods and get closer to answering your research question.
-
On Monday, Nov 25 - Milestone 4:
- Peer review of others’ projects in the lab.
- You must be in the lab to participate in the peer review.
- If you cannot be there physically due to Thanksgiving travel, make arrangements to Zoom in with the rest of your team who are in lab.
- After Monday, Nov 26: Work on write-up and presentation incorporating feedback from peers and meeting with TAs/instructor as needed during office hours for further feedback.
Introduction
In this lab, you’ll continue building and evaluating predictive models and performing statistical inferences.
Additionally, you’ll revisit two questions of your choice from previous assignments/assessments.
Learning objectives
By the end of the lab, you will…
- Train models on training data and test them on testing data to evaluate and compare their performances.
- Perform statistical inference with bootstrapping and hypothesis testing.
- Revisit two questions from previous assignments/assessments.
And, as usual, you will also…
- Get more experience with data science workflow using R, RStudio, Git, and GitHub
- Further your reproducible authoring skills with Quarto
- Improve your familiarity with version control using Git and GitHub
This lab is due on Tue, November 26 at 10:30 pm, i.e., right before Thanksgiving break begins.
Getting started
Log in to RStudio, clone your lab-8
repo from GitHub, open your lab-8.qmd
document, and get started!
Packages
In this lab, we will work with the
- tidyverse package for doing data analysis in a “tidy” way and
- tidymodels package for modeling in a “tidy” way.
-
Run the code cell by clicking on the green triangle (play) button for the code cell labeled
load-packages
. This loads the package so that its features (the functions and datasets in it) are accessible from your Console. - Then, render the document that loads this package to make its features (the functions and datasets in it) available for other code cells in your Quarto document.
Guidelines
As we’ve discussed in lecture, your plots should include an informative title, axes and legends should have human-readable labels, and careful consideration should be given to aesthetic choices.
Additionally, code should follow the tidyverse style. Particularly,
there should be spaces before and line breaks after each
+
when building aggplot
,there should also be spaces before and line breaks after each
|>
in a data transformation pipeline,code should be properly indented,
there should be spaces around
=
signs and spaces after commas.
Furthermore, all code should be visible in the PDF output, i.e., should not run off the page on the PDF. Long lines that run off the page should be split across multiple lines with line breaks.
Continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
You are also expected to pay attention to code smell in addition to code style and readability. You should review and improve your code to avoid redundant steps (e.g., grouping, ungrouping, and grouping again by the same variable in a pipeline), using inconsistent syntax (e.g., !
to say “not” in one place and -
in another place), etc.
Part 1: Hotel cancellations (again)
For this exercise, we will work with hotel cancellations data from the previous lab.
The dataset, called hotels.csv
, can be found in the data
folder. Recall that the data describes the demand for two different types of hotels. Each observation represents a hotel booking between July 1, 2015, and August 31, 2017. Some bookings were canceled (is_canceled = 1
), and others were kept, i.e., the guests checked into the hotel (is_canceled = 0
). You can learn more about the data at https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11 and a data dictionary has been included in the README of the data
folder of your repository.
Question 1
Read the data in with
read_csv()
.Transform the
is_canceled
variable to be a factor with “not canceled” (0) as the first level and “canceled” (1) as the second level. Confirm that you were able to successfully transform the variable by runninglevels(hotel$is_canceled)
.Split the data into training (75%) and testing (25%) sets. Report the number of rows in the testing and training sets.
Fit a model to the training data predicting whether a reservation is canceled from
hotel
type (Resort or City Hotel) and at least two other predictors. Provide a tidy output of the model fit.Make predictions for the testing data. Then, in a single pipeline, calculate the false positive and negative rates. Make sure that your answer makes it clear which rates are which.
Question 2
Fit another model to the training data predicting whether a reservation is canceled from the variables you used in Question 1 plus one other predictor. State the new predictor and why you chose it.
Make predictions for the testing data. Then, in a single pipeline, calculate the false positive and negative rates. Make sure that your answer makes it clear which rates are which.
Question 3
What was the name you used for the model fit object for the model you created in Question 1? What about in Question 2?
-
Plot the ROC curves of the models from Question 1 and Question 2 on the same plot, using different colors for each model and a legend that describes which model is represented with which color.
TipSince the level of
is_canceled
we’re predicting is the second level, we need to setevent_level = "second"
when calculating the ROC curve. Calculate the AUC (area under the curve) for each model using the
roc_auc()
function.
You can review its documentation to figure out how this function works.
- Based on your findings so far, articulate which of the two models is preferable for predicting hotel cancellations and why.
Part 2: Lemurs
In this part, you’ll work with data from the Duke Lemur Center, which houses over 200 lemurs across 14 species – the most diverse population of lemurs on Earth, outside their native Madagascar.
Lemurs are the most threatened group of mammals on the planet, and 95% of lemur species are at risk of extinction. Our mission is to learn everything we can about lemurs – because the more we learn, the better we can work to save them from extinction. They are endemic only to Madagascar, so it’s essentially a one-shot deal: once lemurs are gone from Madagascar, they are gone from the wild.
By studying the variables that most affect their health, reproduction, and social dynamics, the Duke Lemur Center learns how to most effectively focus their conservation efforts. And the more we learn about lemurs, the better we can educate the public around the world about just how amazing these animals are, why they need to be protected, and how each and every one of us can make a difference in their survival.
Source: TidyTuesday
While the TidyTuesday project used the full dataset, you’ll work with a subset. The dataset, called lemurs.csv
, can be found in the data
folder. You can learn more about the data at https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-08-24. A data dictionary has been included in the README of the data
folder of your repository.
Question 4
Load the lemurs
data from your data
folder and save it as lemurs
. Then, report which “types” of lemurs are represented in the sample and how many of each. Note that this information is in the taxon
variable. You should refer back to the linked data dictionary to understand what the different values of taxon
mean. Your response should be a tibble with at least three columns, taxon
, taxon_name
(a new variable you create that contains the description of the taxon
, e.g., EMON
is Mongoose lemur
), and n
(number of lemurs with that taxon).
Question 5
What is the slope of the regression line for predicting weights of lemurs (weight_g
) from the ages of lemurs (in years) when their weight was measured (age_at_wt_y
)? Calculate and interpret a 95% bootstrap bootstrap confidence interval. Also report your point estimate. Don’t forget to set a seed and use 1,000 bootstrap samples (reps = 1000
) when simulating your bootstrap distribution.
Below is a step-by-step recipe for constructing and visualizing a confidence interval. The code snippets shown are not “complete.” They include some blanks you need to fill in, and they are just intended to guide you in the right direction.
- Step 1: Calculate the point estimate for the slope and the intercept of the regression line.
obs_fit <- lemurs |>
specify(weight_g ~ age_at_wt_y) |>
fit()
- Step 2: Simulate a bootstrap distribution.
set.seed(___)
<- lemurs |>
boot_dist specify(weight_g ~ age_at_wt_y) |>
generate(reps = ____, type = "bootstrap") |>
fit()
- Step 3: Calculate the bounds of the confidence interval.
<-
conf_ints get_confidence_interval(
boot_dist, level = ___,
point_estimate = obs_fit
)
Question 6
What are the slopes of the regression line for predicting weights of lemurs (weight_g
) from the ages of lemurs (in years) when their weight was measured (age_at_wt_y
) and their types (taxon
)? Calculate and interpret a 95% bootstrap bootstrap confidence interval. Also report your point estimate. Don’t forget to set a seed and use 1,000 bootstrap samples (reps = 1000
) when simulating your bootstrap distribution.
Question 7
What is the median weight of red-bellied lemurs? What is the median weight of ring-tailed lemurs? What is the median weight of mongoose lemurs? Calculate and interpret a 95% bootstrap bootstrap confidence intervals. Also report your point estimates. Don’t forget to set a seed and use 1,000 bootstrap samples (reps = 1000
) when simulating your bootstrap distribution.
Question 8
Do female lemurs differ in weight from male lemurs, on average?
Conduct a hypothesis test to answer this question at 5% significance level. Clearly state your hypotheses in the context of the data and the research question, simulate a randomization distribution, find the p-value, and make a decision on your hypotheses based on this p-value. Provide a one-sentence conclusion for your hypothesis test in the context of the data and the research question. Don’t forget to set a seed and use 1,000 randomization samples (
reps = 1000
) when simulating your randomization distribution.Based on your answer to Part (a), would you expect a 95% confidence interval for the difference in means of female and male lemurs to include 0? Explain your reasoning.
Construct and interpret a 95% bootstrap confidence interval for the difference in means of female and male lemurs. Does it include 0? Does this align with your answer to Part (b)? Don’t forget to set a seed and use 1,000 bootstrap samples (
reps = 1000
) when simulating your bootstrap distribution.
Part 3: Second chances
In this part, you get to revisit work you’ve turned in before that can be improved. For Question 9 you’ll pick a question from a previous lab and for Question 10 a question from your take-home midterm, and improve them.
Some of these improvements might be “fixes” for things that were pointed out as missing or incorrect. Some of them might be things you hoped to do before the deadline but didn’t get a chance to. Some notes for completing this exercise:
You might need to add your data from your lab/midterm to the
data/
folder in this assignment. You do not need to also add a data dictionary.You will need to copy over any code needed to clean/prepare your data for this specific question. You can reuse code from your previous assignment, but note that we will re-evaluate your code as part of the grading for this exercise. This means we might catch something wrong with it that we didn’t catch before, so if you spot any errors, make sure to fix them.
Don’t worry about being critical of your own work. Even if you lost no points on the question if you think it can be improved, articulate how / why. We will not apply additional penalties for any mistakes you might point out that we didn’t catch when grading your lab/midterm. There’s no risk of being critical!
Question 9
Choose one question from a previous lab that you received lots of feedback on and/or lost many points on and improve it. First, write one sentence reminding us of the specific question and a few sentences on why you chose this specific question to improve and how you plan to improve it. Then, improve it.
Question 10
Choose one question from the take-home midterm that you received lots of feedback on and/or lost many points on and improve it. First, write one sentence reminding us of the specific question and a few sentences on why you chose this specific question to improve and how you plan to improve it. Then, improve it.