Lab 5

Visualize, model, interpret

Lab

Mon, Nov 4 at 8:30 am

Introduction

In this lab you’ll start your practice of statistical modeling. You’ll fit models, interpret model output, and make decisions about your data and research question based on the model results. And you’ll wrap up the lab revisiting the topic of data science ethics.

Note

This lab assumes you’ve completed the labs so far and doesn’t repeat setup and overview content from those labs. If you haven’t done those yet, you should review the previous labs before starting on this one.

Learning objectives

By the end of the lab, you will…

Fit linear regression models and interpret model coefficients in context of the data and research question.
Transform data using a log-transformation for a better model fit.
Evaluate the validity of a data visualization and correct any misleading mistakes.

And, as usual, you will also…

Get more experience with data science workflow using R, RStudio, Git, and GitHub
Further your reproducible authoring skills with Quarto
Improve your familiarity with version control using Git and GitHub

Getting started

Click here if you prefer to see step-by-step instructions

Step 1: Log in to RStudio

Go to https://cmgr.oit.duke.edu/containers and log in with your Duke NetID and Password.
Click STA198-199 under My reservations to log into your container. You should now see the RStudio environment.

Step 2: Clone the repo & start a new RStudio project

Go to the course organization at github.com/sta199-f24 organization on GitHub. Click on the repo with the prefix lab-5. It contains the starter documents you need to complete the lab.
Click on the green CODE button and select Use SSH. This might already be selected by default; if it is, you’ll see the text Clone with SSH. Click on the clipboard icon to copy the repo URL.
In RStudio, go to File ➛ New Project ➛Version Control ➛ Git.
Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.
Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.
Click lab-5.qmd to open the template Quarto file. This is where you will write up your code and narrative for the lab.

Step 3: Update the YAML

In lab-5.qmd, update the author field to your name, render your document and examine the changes. Then, in the Git pane, click on Diff to view your changes, add a commit message (e.g., “Added author name”), and click Commit. Then, push the changes to your GitHub repository, and in your browser confirm that these changes have indeed propagated to your repository.

Important

If you run into any issues with the first steps outlined above, flag a TA for help before proceeding.

Packages

In this lab, we will work with the

tidyverse package for doing data analysis in a “tidy” way,
tidymodels package for modeling in a “tidy” way, and
openintro package for the dataset for Part 1.

library(tidyverse)
library(tidymodels)
library(openintro)

Run the code cell by clicking on the green triangle (play) button for the code cell labeled load-packages. This loads the package so that its features (the functions and datasets in it) are accessible from your Console.
Then, render the document that loads this package to make its features (the functions and datasets in it) available for other code cells in your Quarto document.

Guidelines

As we’ve discussed in lecture, your plots should include an informative title, axes and legends should have human-readable labels, and careful consideration should be given to aesthetic choices.

Additionally, code should follow the tidyverse style. Particularly,

there should be spaces before and line breaks after each + when building a ggplot,
there should also be spaces before and line breaks after each |> in a data transformation pipeline,
code should be properly indented,
there should be spaces around = signs and spaces after commas.

Furthermore, all code should be visible in the PDF output, i.e., should not run off the page on the PDF. Long lines that run off the page should be split across multiple lines with line breaks.

Important

Continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

Important

You are also expected to pay attention to code smell in addition to code style and readability. You should review and improve your code to avoid redundant steps (e.g., grouping, ungrouping, and grouping again by the same variable in a pipeline), using inconsistent syntax (e.g., ! to say “not” in one place and - in another place), etc.

Part 1 - Smoking during pregnancy

Question 1

We are interested in the impact of smoking during pregnancy. Since it is not possible to run a randomized controlled experiment to investigate this impact, we will instead use a data set has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. This is a random sample of 1,000 cases from a data set released in 2014 by the state of North Carolina. The data set is called births14, and it is included in the openintro package you loaded at the beginning of the assignment.

Create a version of the births14 data set dropping observations where there are NAs for habit. You can call this version births14_habitgiven.
Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions. Create an appropriate plot displaying the relationship between weight and habit. In 2-3 sentences, discuss the relationship observed.

Question 2

Fit a linear model that predicts weight from habit and save the model object. Then, provide the tidy summary output.
Write the estimated least squares regression line below using proper notation.

Tip

If you need to type an equation using proper notation, type your answers in-between $$ and $$. You may use \hat{example} to put a hat on a character.

Interpret the intercept in the context of the data and the research question. Is the intercept meaningful in this context? Why or why not?
Interpret the slope in the context of the data and the research question.

Question 3

Another researcher is interested in assessing the relationship between babies’ weights and mothers’ ages. Fit another linear model predicting weight to investigate this relationship and save the model object. Then, provide the tidy summary output.
In 2-3 sentences, explain how the regression line to model these data is fit, i.e., based on what criteria R determines the regression line.
Interpret the intercept in the context of the data and the research question. Is the intercept meaningful in this context? Why or why not?
Interpret the slope in the context of the data and the research question.

Part 2 - Parasites

Parasites can cause infectious disease – but not all animals are affected by the same parasites. Some parasites are present in a multitude of species and others are confined to a single host. It is hypothesized that closely related hosts are more likely to share the same parasites. More specifically, it is thought that closely related hosts will live in similar environments and have similar genetic makeup that coincides with optimal conditions for the same parasite to flourish.

In this part of the lab, we will see how much evolutionary history predicts parasite similarity.

The dataset comes from an Ecology Letters paper by Cooper at al. (2012) entitled “Phylogenetic host specificity and understanding parasite sharing in primates” located here. The goal of the paper was to identify the ability of evolutionary history and ecological traits to characterize parasite host specificity.

Each row of the data contains two species, species1 and species2.

Subsequent columns describe metrics that compare the species:

divergence_time: how many (millions) of years ago the two species diverged. i.e. how many million years ago they were the same species.
distance: geodesic distance between species geographic range centroids (in kilometers)
BMdiff: difference in body mass between the two species (in grams)
precdiff: difference in mean annual precipitation across the two species geographic ranges (mm)
parsim: a measure of parasite similarity (proportion of parasites shared between species, ranges from 0 to 1.)

The data are available in parasites.csv in your data folder.

Question 4

Let’s start by reading in the parasites data and examining the relationship between divergence_time and parsim.

Load the data and save the data frame as parasites.
Based on the goals of the analysis, what is the response variable?
Visualize the relationship between the two variables.
Use the visualization to describe the relationship between the two variables.

Question 5

Next, model this relationship.

Fit the model and write the estimated regression equation.
Interpret the slope and the intercept in the context of the data.
Recreate the visualization from Question 4, this time adding a regression line to the visualization.
What do you notice about the prediction (regression) line that may be strange, particularly for very large divergence times?

Question 6

Since parsim takes values between 0 and 1, we want to transform this variable so that it can range between (−∞,+∞). This will be better suited for fitting a regression model (and interpreting predicted values!)

Using mutate, create a new variable transformed_parsim that is calculated as log(parsim/(1-parsim)). Add this variable to your data frame.

Note

log() in R represents the nautral log.
Then, visualize the relationship between divergence_time and transformed_parsim. Add a regression line to your visualization.
Write a 1-2 sentence description of what you observe in the visualization.

Question 7

Which variable is the strongest individual predictor of parasite similarity between species?

To answer this question, begin by fitting a linear regression model to each pair of variables. Do not report the model outputs in a tidy format but save each one as dt_model, dist_model, BM_model, and prec_model, respectively.

divergence_time and transformed_parsim
distance and transformed_parsim
BMdiff and transformed_parsim
precdiff and transformed_parsim

Report the slopes for each of these models. Use proper notation.
To answer the question of interest, would it be useful to compare the slopes in each model to choose the variable that is the strongest predictor of parasite similarity? Why or why not?

Question 8

Now, what if we calculated $R^2$ to help answer our question? To compare the explanatory power of each individual predictor, we will look at $R^2$ between the models. $R^2$ is a measure of how much of the variability in the response variable is explained by the model.

As you may have guessed from the name $R^2$ can be calculated by squaring the correlation when we have a simple linear regression model. The correlation r takes values -1 to 1, therefore, $R^2$ takes values 0 to 1. Intuitively, if r=1 or −1, then $R^2$=1, indicating the model is a perfect fit for the data. If r≈0 then $R^2$≈0, indicating the model is a very bad fit for the data.

You can calculate $R^2$ using the glance function. For example, you can calculate $R^2$ for dt_model using the code glance(dt_model)$r.squared.

Calculate and report $R^2$ for each model fit in the previous exercise.
To answer our question of interest, would it be useful to compare the $R^2$ in each model to choose the variable that is the strongest predictor of parasite similarity? Why or why not?

Part 3 - Misrepresentation

Question 9

The following chart was shared by @GraphCrimes on X/Twitter on September 3, 2022.

What is misleading about this graph?
Suppose you wanted to recreate this plot, with improvements to avoid its misleading pitfalls from part (a). You would obviously need the data from the survey in order to be able to do that. How many observations would this data have? How many variables (at least) should it have, and what should those variables be?
Load the data for this survey from data/survation.csv. Confirm that the data match the percentages from the visualization. That is, calculate the percentages of public sector, private sector, don’t know for each of the services and check that they match the percentages from the plot.

Question 10

Create an improved version of the visualization. Your improved visualization:

should also be a stacked bar chart with services on the y-axis and presented in the same order as the original plot; sectors should create segments of the plot using the same colors (colors do not have to be exact) and appear in the same order as the original plot
should have the same legend location
should have the same title and caption
does not need to have a bolded title or a gray background

How does the improved visualization look different than the original? Does it send a different message at a first glance?

Tip

Use \n to add a line break to your title. And note that since the title is very long, it might run off the page in your code. That’s ok!

Additionally, the colors used in the plot are gray, #FF3205, and #006697.

Render, commit, and push one last time. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Part 4 - Coded Bias

Watch the documentary “Coded Bias”. To do so, you either need to be on the Duke network or connected to the Duke VPN. Then go to https://find.library.duke.edu/catalog/DUKE009834953 and click on “View Online”. There is nothing to turn in for this part. Once you watch the video, reflect on points highlighting at least one thing that you already knew about (from the course prep materials) and at least one thing you learned from the movie as well as any other aspects of the documentary that you found interesting / enlightening. There is nothing to turn in for this part.