Lab 7
Explore and classify
Introduction
In this lab you’ll start your practice of statistical modeling. You’ll fit models for classification, interpret model output, and make decisions about your data and research question based on the model results. And you’ll wrap up the lab sharing a statistics and/or data science experience.
This lab assumes you’ve completed the labs so far and doesn’t repeat setup and overview content from those labs. If you haven’t done those yet, you should review the previous labs before starting on this one.
Learning objectives
By the end of the lab, you will…
Fit and interpret logistic regression models
Understand the difference between odds, log odds, and probability of an event
Make classification based on predicted probability of an event
See statistics and/or data science in action in the real world
And, as usual, you will also…
- Get more experience with data science workflow using R, RStudio, Git, and GitHub
- Further your reproducible authoring skills with Quarto
- Improve your familiarity with version control using Git and GitHub
Getting started
Log in to RStudio, clone your lab-7
repo from GitHub, open your lab-7.qmd
document, and get started!
Step 1: Log in to RStudio
- Go to https://cmgr.oit.duke.edu/containers and log in with your Duke NetID and Password.
- Click
STA198-199
under My reservations to log into your container. You should now see the RStudio environment.
Step 2: Clone the repo & start a new RStudio project
Go to the course organization at github.com/sta199-f24 organization on GitHub. Click on the repo with the prefix lab-7. It contains the starter documents you need to complete the lab.
Click on the green CODE button and select Use SSH. This might already be selected by default; if it is, you’ll see the text Clone with SSH. Click on the clipboard icon to copy the repo URL.
In RStudio, go to File ➛ New Project ➛Version Control ➛ Git.
Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.
Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.
Click lab-7.qmd to open the template Quarto file. This is where you will write up your code and narrative for the lab.
Step 3: Update the YAML
In lab-7.qmd
, update the author
field to your name, render your document and examine the changes. Then, in the Git pane, click on Diff to view your changes, add a commit message (e.g., “Added author name”), and click Commit. Then, push the changes to your GitHub repository, and in your browser confirm that these changes have indeed propagated to your repository.
If you run into any issues with the first steps outlined above, flag a TA for help before proceeding.
Packages
In this lab, we will work with the
- tidyverse package for doing data analysis in a “tidy” way,
- tidymodels package for modeling in a “tidy” way, and
- openintro package for the dataset for Part 1. TO DO: Check.
-
Run the code cell by clicking on the green triangle (play) button for the code cell labeled
load-packages
. This loads the package so that its features (the functions and datasets in it) are accessible from your Console. - Then, render the document that loads this package to make its features (the functions and datasets in it) available for other code cells in your Quarto document.
Guidelines
As we’ve discussed in lecture, your plots should include an informative title, axes and legends should have human-readable labels, and careful consideration should be given to aesthetic choices.
Additionally, code should follow the tidyverse style. Particularly,
there should be spaces before and line breaks after each
+
when building aggplot
,there should also be spaces before and line breaks after each
|>
in a data transformation pipeline,code should be properly indented,
there should be spaces around
=
signs and spaces after commas.
Furthermore, all code should be visible in the PDF output, i.e., should not run off the page on the PDF. Long lines that run off the page should be split across multiple lines with line breaks.
Continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
You are also expected to pay attention to code smell in addition to code style and readability. You should review and improve your code to avoid redundant steps (e.g., grouping, ungrouping, and grouping again by the same variable in a pipeline), using inconsistent syntax (e.g., !
to say “not” in one place and -
in another place), etc.
Part 1 - Building a spam filter
The data come from incoming emails in David Diez’s (one of the authors of OpenIntro textbooks) Gmail account for the first three months of 2012. All personally identifiable information has been removed. The dataset is called email
and it’s in the openintro package.
The outcome variable is spam
, which takes the value 1
if the email is spam, 0
otherwise.
Question 1
What type of variable is
spam
? What percent of the emails are spam?What type of variable is
dollar
- number of times a dollar sign or the word “dollar” appeared in the email? Visualize and describe its distribution, supporting your description with the appropriate summary statistics.Fit a logistic regression model predicting
spam
fromdollar
. Then, display the tidy output of the model.-
Using this model and the
predict()
function, predict the probability the email is spam if it contains 5 dollar signs. Based on this probability, how does the model classify this email?NoteTo obtain the predicted probability, you can set the
type
argument inpredict()
to"prob"
.
Question 2
Fit another logistic regression model predicting
spam
fromdollar
,winner
(indicating whether “winner” appeared in the email), andurgent_subj
(whether the word “urgent” is in the subject of the email). Then, display the tidy output of the model.Using this model and the
augment()
function, classify each email in theemail
dataset as spam or not spam. Store the resulting data frame with an appropriate name and display the data frame as well.-
Using your data frame from the previous part, determine, in a single pipeline, and using
count()
, the numbers of emails:- that are labelled as spam that are actually spam
- that are not labelled as spam that are actually spam
- that are labelled as spam that are actually not spam
- that are not labelled as spam that are actually not spam
Store the resulting data frame with an appropriate name and display the data frame as well.
In a single pipeline, and using
mutate()
, calculate the false positive and false negative rates. In addition to these numbers showing in your R output, you must write a sentence that explicitly states and identified the two rates.
Question 3
Fit another logistic regression model predicting
spam
fromdollar
and another variable you think would be a good predictor. Provide a 1-sentence justification for why you chose this variable. Display the tidy output of the model.Using this model and the
augment()
function, classify each email in theemail
dataset as spam or not spam. Store the resulting data frame with an appropriate name and display the data frame as well.-
Using your data frame from the previous part, determine, in a single pipeline, and using
count()
, the numbers of emails:- that are labelled as spam that are actually spam
- that are not labelled as spam that are actually spam
- that are labelled as spam that are actually not spam
- that are not labelled as spam that are actually not spam
Store the resulting data frame with an appropriate name and display the data frame as well.
In a single pipeline, and using
mutate()
, calculate the false positive and false negative rates. In addition to these numbers showing in your R output, you must write a sentence that explicitly states and identified the two rates.Based on the false positive and false negatives rates of this model, comment, in 1-2 sentences, on which model (one from Question 2 or Question 3) is preferable and why.
Part 2 - Hotel cancellations
For this exercise, we will work with hotel cancellations. The data describe the demand of two different types of hotels. Each observation represents a hotel booking between July 1, 2015 and August 31, 2017. Some bookings were cancelled (is_canceled = 1
) and others were kept, i.e., the guests checked into the hotel (is_canceled = 0
). You can view the code book for all variables here.
Question 4
The dataset, called hotels.csv
, can be found in the data
folder.
Read it in with
read_csv()
.Transform the
is_canceled
variable to be a factor with “not canceled” (0) as the first level and “canceled” (1) as the second level.Explore attributes of bookings and summarize your findings in 5 bullet points. You must provide a visualization or summary supporting each finding.
This is not meant to be an exhaustive exploration. We anticipate a wide variety of answers to this question.
Question 5
Using these data, one of our goals is to explore the following question:
Are reservations earlier in the month or later in the month more likely to be cancelled?
In a single pipeline, calculate the mean arrival dates (
arrival_date_day_of_month
) for reservations that were cancelled and reservations that were not cancelled.In your own words, explain why we can not use a linear model to model the relationship between if a hotel reservation was cancelled and the day of month for the booking.
Fit the appropriate model to predict whether a reservation was cancelled from
arrival_date_day_of_month
and display a tidy summary of the model output. Then, interpret the slope coefficient in context of the data and the research question.Calculate the probability that the hotel reservation is cancelled if it the arrival date date is on the 26th of the month. Based on this probability, would you predict this booking would be cancelled or not cancelled. Explain your reasoning for your classification.
Question 6
Fit another model to predict whether a reservation was cancelled from
arrival_date_day_of_month
andhotel
type (Resort or City Hotel), allowing the relationship betweenarrival_date_day_of_month
andis_canceled
to vary based onhotel
type. Display a tidy output of the model.Interpret the intercept in context of the data.
Part 4 - Statistics and/or data science experience
Question 10
Start this question early, it’s not one you want to try to cram into the last night before the deadline!
You have two options for this exercise. Clearly indicate which option you choose. Then, summarize your experience in no more than 10 bullet points.
Include the following on your summary:
- Name and brief description of what you did.
- Something you found new, interesting, or unexpected
- How the talk/podcast/interview/etc. connects to something we’ve done in class.
- Citation or link to web page for what you watched or who you interviewed.
Option 1: Listen to a podcast or watch a video about statistics and data science. The podcast or video must be at least 30 minutes to count towards the statistics experience. A few suggestions are below:
- posit::conf 2024 talks
- useR 2024 talks or user 2024 keynotes
- posit::conf 2023 talks
- rstudio::conf 2022 talks
- Harvard Data Science Review Podcast
- Stats + Stories Podcast
- Casual Inference Podcast
- Not So Standard Deviations
This list is not exhaustive. You may listen to other podcasts or watch other statistics/data science videos not included on this list. Ask your professor if you are unsure whether a particular podcast or video will count towards the statistics experience.
Option 2: Talk with someone who uses statistics and/or data science in their daily work. This could include a professor, professional in industry, graduate student, etc.
Footnotes
Smith, Tom W, Peter Marsden, Michael Hout, and Jibum Kim. General Social Surveys, 1972-2016 [machine-readable data file] /Principal Investigator, Tom W. Smith; Co-Principal Investigator, Peter V. Marsden; Co-Principal Investigator, Michael Hout; Sponsored by National Science Foundation. -NORC ed.- Chicago: NORC at the University of Chicago [producer and distributor]. Data come from the gssr R Package, http://kjhealy.github.io/gssr.↩︎