Lab 7

Explore and classify

Lab
Mon, Nov 18 at 8:30 am

Introduction

In this lab you’ll start your practice of statistical modeling. You’ll fit models for classification, interpret model output, and make decisions about your data and research question based on the model results. And you’ll wrap up the lab sharing a statistics and/or data science experience.

Note

This lab assumes you’ve completed the labs so far and doesn’t repeat setup and overview content from those labs. If you haven’t done those yet, you should review the previous labs before starting on this one.

Learning objectives

By the end of the lab, you will…

  • Fit and interpret logistic regression models

  • Understand the difference between odds, log odds, and probability of an event

  • Make classification based on predicted probability of an event

  • See statistics and/or data science in action in the real world

And, as usual, you will also…

  • Get more experience with data science workflow using R, RStudio, Git, and GitHub
  • Further your reproducible authoring skills with Quarto
  • Improve your familiarity with version control using Git and GitHub

Getting started

Log in to RStudio, clone your lab-7 repo from GitHub, open your lab-7.qmd document, and get started!

Step 1: Log in to RStudio

  • Go to https://cmgr.oit.duke.edu/containers and log in with your Duke NetID and Password.
  • Click STA198-199 under My reservations to log into your container. You should now see the RStudio environment.

Step 2: Clone the repo & start a new RStudio project

  • Go to the course organization at github.com/sta199-f24 organization on GitHub. Click on the repo with the prefix lab-7. It contains the starter documents you need to complete the lab.

  • Click on the green CODE button and select Use SSH. This might already be selected by default; if it is, you’ll see the text Clone with SSH. Click on the clipboard icon to copy the repo URL.

  • In RStudio, go to FileNew ProjectVersion ControlGit.

  • Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.

  • Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.

  • Click lab-7.qmd to open the template Quarto file. This is where you will write up your code and narrative for the lab.

Step 3: Update the YAML

In lab-7.qmd, update the author field to your name, render your document and examine the changes. Then, in the Git pane, click on Diff to view your changes, add a commit message (e.g., “Added author name”), and click Commit. Then, push the changes to your GitHub repository, and in your browser confirm that these changes have indeed propagated to your repository.

Important

If you run into any issues with the first steps outlined above, flag a TA for help before proceeding.

Packages

In this lab, we will work with the

  • tidyverse package for doing data analysis in a “tidy” way,
  • tidymodels package for modeling in a “tidy” way, and
  • openintro package for the dataset for Part 1. TO DO: Check.
  • Run the code cell by clicking on the green triangle (play) button for the code cell labeled load-packages. This loads the package so that its features (the functions and datasets in it) are accessible from your Console.
  • Then, render the document that loads this package to make its features (the functions and datasets in it) available for other code cells in your Quarto document.

Guidelines

As we’ve discussed in lecture, your plots should include an informative title, axes and legends should have human-readable labels, and careful consideration should be given to aesthetic choices.

Additionally, code should follow the tidyverse style. Particularly,

  • there should be spaces before and line breaks after each + when building a ggplot,

  • there should also be spaces before and line breaks after each |> in a data transformation pipeline,

  • code should be properly indented,

  • there should be spaces around = signs and spaces after commas.

Furthermore, all code should be visible in the PDF output, i.e., should not run off the page on the PDF. Long lines that run off the page should be split across multiple lines with line breaks.

Important

Continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

Important

You are also expected to pay attention to code smell in addition to code style and readability. You should review and improve your code to avoid redundant steps (e.g., grouping, ungrouping, and grouping again by the same variable in a pipeline), using inconsistent syntax (e.g., ! to say “not” in one place and - in another place), etc.

Part 1 - Building a spam filter

The data come from incoming emails in David Diez’s (one of the authors of OpenIntro textbooks) Gmail account for the first three months of 2012. All personally identifiable information has been removed. The dataset is called email and it’s in the openintro package.

The outcome variable is spam, which takes the value 1 if the email is spam, 0 otherwise.

Question 1

  1. What type of variable is spam? What percent of the emails are spam?

  2. What type of variable is dollar - number of times a dollar sign or the word “dollar” appeared in the email? Visualize and describe its distribution, supporting your description with the appropriate summary statistics.

  3. Fit a logistic regression model predicting spam from dollar. Then, display the tidy output of the model.

  4. Using this model and the predict() function, predict the probability the email is spam if it contains 5 dollar signs. Based on this probability, how does the model classify this email?

    Note

    To obtain the predicted probability, you can set the type argument in predict() to "prob".

Question 2

  1. Fit another logistic regression model predicting spam from dollar, winner (indicating whether “winner” appeared in the email), and urgent_subj (whether the word “urgent” is in the subject of the email). Then, display the tidy output of the model.

  2. Using this model and the augment() function, classify each email in the email dataset as spam or not spam. Store the resulting data frame with an appropriate name and display the data frame as well.

  3. Using your data frame from the previous part, determine, in a single pipeline, and using count(), the numbers of emails:

    • that are labelled as spam that are actually spam
    • that are not labelled as spam that are actually spam
    • that are labelled as spam that are actually not spam
    • that are not labelled as spam that are actually not spam

    Store the resulting data frame with an appropriate name and display the data frame as well.

  4. In a single pipeline, and using mutate(), calculate the false positive and false negative rates. In addition to these numbers showing in your R output, you must write a sentence that explicitly states and identified the two rates.

Question 3

  1. Fit another logistic regression model predicting spam from dollar and another variable you think would be a good predictor. Provide a 1-sentence justification for why you chose this variable. Display the tidy output of the model.

  2. Using this model and the augment() function, classify each email in the email dataset as spam or not spam. Store the resulting data frame with an appropriate name and display the data frame as well.

  3. Using your data frame from the previous part, determine, in a single pipeline, and using count(), the numbers of emails:

    • that are labelled as spam that are actually spam
    • that are not labelled as spam that are actually spam
    • that are labelled as spam that are actually not spam
    • that are not labelled as spam that are actually not spam

    Store the resulting data frame with an appropriate name and display the data frame as well.

  4. In a single pipeline, and using mutate(), calculate the false positive and false negative rates. In addition to these numbers showing in your R output, you must write a sentence that explicitly states and identified the two rates.

  5. Based on the false positive and false negatives rates of this model, comment, in 1-2 sentences, on which model (one from Question 2 or Question 3) is preferable and why.

Part 2 - Hotel cancellations

For this exercise, we will work with hotel cancellations. The data describe the demand of two different types of hotels. Each observation represents a hotel booking between July 1, 2015 and August 31, 2017. Some bookings were cancelled (is_canceled = 1) and others were kept, i.e., the guests checked into the hotel (is_canceled = 0). You can view the code book for all variables here.

Question 4

The dataset, called hotels.csv, can be found in the data folder.

  1. Read it in with read_csv().

  2. Transform the is_canceled variable to be a factor with “not canceled” (0) as the first level and “canceled” (1) as the second level.

  3. Explore attributes of bookings and summarize your findings in 5 bullet points. You must provide a visualization or summary supporting each finding.

Note

This is not meant to be an exhaustive exploration. We anticipate a wide variety of answers to this question.

Question 5

Using these data, one of our goals is to explore the following question:

Are reservations earlier in the month or later in the month more likely to be cancelled?

  1. In a single pipeline, calculate the mean arrival dates (arrival_date_day_of_month) for reservations that were cancelled and reservations that were not cancelled.

  2. In your own words, explain why we can not use a linear model to model the relationship between if a hotel reservation was cancelled and the day of month for the booking.

  3. Fit the appropriate model to predict whether a reservation was cancelled from arrival_date_day_of_month and display a tidy summary of the model output. Then, interpret the slope coefficient in context of the data and the research question.

  4. Calculate the probability that the hotel reservation is cancelled if it the arrival date date is on the 26th of the month. Based on this probability, would you predict this booking would be cancelled or not cancelled. Explain your reasoning for your classification.

Question 6

  1. Fit another model to predict whether a reservation was cancelled from arrival_date_day_of_month and hotel type (Resort or City Hotel), allowing the relationship between arrival_date_day_of_month and is_canceled to vary based on hotel type. Display a tidy output of the model.

  2. Interpret the intercept in context of the data.

Part 3 - General Social Survey

The General Social Survey (GSS) gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviours, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years.

The GSS contains a standard core of demographic, behavioural, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.1

In this part you will work with three variables from the 2022 General Social Survey:

  • advfront: Do you strongly agree, agree, disagree, or strongly disagree? with the following statement: “Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government.”

  • educ: Number of years of education.

  • polviews: We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal–point 1–to extremely conservative–point 7. Where would you place yourself on this scale?

Question 7

  1. Read in the gss22.csv file that is in your data folder and save it as gss22. Report its number of rows and columns.

  2. Create a new data frame called gss22_advfront that only contains the variables advfront, educ, and polviews. Then, use the drop_na() function to remove rows that contain NAs from this new data frame. Report the number of rows and columns of gss22_advfront. Additionally, report what percent of the observations were discarded at this step.

  3. Re-level the advfront variable such that it has two levels: "Strongly agree" and "Agree" combined into a new level called Agree and the remaining levels combined into”Not agree". Then, re-order the levels in the following order: "Agree" and "Not agree". Finally, count() how many times each new level appears in the advfront variable.

Tip

You can do this in various ways. One option is to use the str_detect() function to detect the existence of words. Note that these sometimes show up with lowercase first letters and sometimes with upper case first letters. To detect either in the str_detect() function, you can use “[Aa]gree”. However, solve the problem however you like, this is just one option!

  1. Combine the levels of the polviews variable such that levels that have the word “liberal” in them are lumped into a level called "Liberal" and those that have the word conservative in them are lumped into a level called "Conservative". Then, re-order the levels in the following order: "Conservative" , "Moderate", and "Liberal". Finally, count() how many times each new level appears in the polviews variable.

Question 8

  1. Fit a logistic regression model that predicts advfront from educ. Report the tidy output of the model.

  2. Write out the estimated model in proper notation.

  3. Using your estimated model, predict the probability of agreeing with the statement “Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government.” (Agree in advfront) for someone with 7 years of education.

Question 9

  1. Fit a new model that adds the additional explanatory variable of polviews to your model from Question 8. Report the tidy output of the model.

  2. Now, predict the probability of agreeing with the following statement “Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government.” (Agree in advfront) for a Conservative person with 7 years of education.

Part 4 - Statistics and/or data science experience

Question 10

Warning

Start this question early, it’s not one you want to try to cram into the last night before the deadline!

You have two options for this exercise. Clearly indicate which option you choose. Then, summarize your experience in no more than 10 bullet points.

Include the following on your summary:

  • Name and brief description of what you did.
  • Something you found new, interesting, or unexpected
  • How the talk/podcast/interview/etc. connects to something we’ve done in class.
  • Citation or link to web page for what you watched or who you interviewed.

Option 1: Listen to a podcast or watch a video about statistics and data science. The podcast or video must be at least 30 minutes to count towards the statistics experience. A few suggestions are below:

This list is not exhaustive. You may listen to other podcasts or watch other statistics/data science videos not included on this list. Ask your professor if you are unsure whether a particular podcast or video will count towards the statistics experience.

Option 2: Talk with someone who uses statistics and/or data science in their daily work. This could include a professor, professional in industry, graduate student, etc.

Footnotes

  1. Smith, Tom W, Peter Marsden, Michael Hout, and Jibum Kim. General Social Surveys, 1972-2016 [machine-readable data file] /Principal Investigator, Tom W. Smith; Co-Principal Investigator, Peter V. Marsden; Co-Principal Investigator, Michael Hout; Sponsored by National Science Foundation. -NORC ed.- Chicago: NORC at the University of Chicago [producer and distributor]. Data come from the gssr R Package, http://kjhealy.github.io/gssr.↩︎