Lab 4

Everything so far I

Lab

Mon, Oct 7 at 8:30 am

Introduction

In this lab, you’ll review topics you’ve worked with in previous labs, practice importing data, and dive into the concepts of data types and classes.

Note

This lab assumes you’ve completed the labs so far and doesn’t repeat setup and overview content from those labs. If you haven’t done those yet, you should review the previous labs before starting on this one.

Learning objectives

By the end of the lab, you will…

Learn to read data in from Excel spreadsheets
Gain more experience with joining and pivoting data frames
Review Quarto cell options

And, as usual, you will also…

Get more experience with data science workflow using R, RStudio, Git, and GitHub
Further your reproducible authoring skills with Quarto
Improve your familiarity with version control using Git and GitHub

Getting started

Click here if you prefer to see step-by-step instructions

Step 1: Log in to RStudio

Go to https://cmgr.oit.duke.edu/containers and log in with your Duke NetID and Password.
Click STA198-199 under My reservations to log into your container. You should now see the RStudio environment.

Step 2: Clone the repo & start a new RStudio project

Go to the course organization at github.com/sta199-f24 organization on GitHub. Click on the repo with the prefix lab-4. It contains the starter documents you need to complete the lab.
Click on the green CODE button and select Use SSH. This might already be selected by default; if it is, you’ll see the text Clone with SSH. Click on the clipboard icon to copy the repo URL.
In RStudio, go to File ➛ New Project ➛Version Control ➛ Git.
Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.
Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.
Click lab-4.qmd to open the template Quarto file. This is where you will write up your code and narrative for the lab.

Step 3: Update the YAML

In lab-4.qmd, update the author field to your name, render your document and examine the changes. Then, in the Git pane, click on Diff to view your changes, add a commit message (e.g., “Added author name”), and click Commit. Then, push the changes to your GitHub repository, and in your browser confirm that these changes have indeed propagated to your repository.

Important

If you run into any issues with the first steps outlined above, flag a TA for help before proceeding.

Packages

In this lab, we will work with the

tidyverse package for doing data analysis in a “tidy” way,
readxl package for reading in Excel files,
janitor package for cleaning up variable names, and
palmerpenguins and datasauRus packages for some datasets

library(tidyverse)
library(readxl)
library(janitor)
library(palmerpenguins)
library(datasauRus)

Run the code cell by clicking on the green triangle (play) button for the code cell labeled load-packages. This loads the package so that its features (the functions and datasets in it) are accessible from your Console.
Then, render the document that loads this package to make its features (the functions and datasets in it) available for other code cells in your Quarto document.

Guidelines

As we’ve discussed in lecture, your plots should include an informative title, axes and legends should have human-readable labels, and careful consideration should be given to aesthetic choices.

Additionally, code should follow the tidyverse style. Particularly,

there should be spaces before and line breaks after each + when building a ggplot,
there should also be spaces before and line breaks after each |> in a data transformation pipeline,
code should be properly indented,
there should be spaces around = signs and spaces after commas.

Furthermore, all code should be visible in the PDF output, i.e., should not run off the page on the PDF. Long lines that run off the page should be split across multiple lines with line breaks.

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub.

Important

Continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

Important

Starting with this lab, you are expected to pay attention to code smell in addition to code style and readability. You should review and improve your code to avoid redundant steps (e.g., grouping, ungrouping, and grouping again by the same variable in a pipeline), using inconsistent syntax (e.g., ! to say “not” in one place and - in another place), etc.

Questions

Part 1 - Team USA at the Olympics

For this part, you’ll work with data from the rosters of Team USA from the 2020 and 2024 Olympics. The data come from https://www.teamusa.com and the rosters for the two games are in a single Excel file (team-usa.xlsx in your data folder), accross two separate spreadsheets within that file. Figure 1 shows screenshots of these spreadsheets.

Your goal is to answer questions about athletes who competed in both games and only one of the games.

Question 1

Read data from the two sheets of team-usa.xlsx as two separate data frames called team_usa_2020 and team_usa_2024.

Tip

The names of the sheets are shown in the screenshots in Figure 1, or you can use the excel_sheets() function to discover them. Additionally, note that the first row of the sheets contain a logo and a title describing the contents of the data, and not the header row containing variable names.

Read the documentation for the clean_names() function from the janitor package at https://sfirke.github.io/janitor/reference/clean_names.html. Use this function to “clean” the variable names of team_usa_2020 and team_usa_2024 and save the data frames with the new variable names.
Create a new variable in both of the datasets called name that:
- paste()s together the first_name and last_name variables with a space in between and
- is the first variable in the resulting data frame.
Using the appropriate *_join() function, determine how many athletes participated in both Olympic Games?

Important

Your answer to this question, based on the data frames you created, should be 0, even if it doesn’t make sense in context of actual Olympic athletes.

Question 2

If you have even a passing knowledge of the Olympic Games, you might know that there are some athletes that participated in both the 2020 and 2024 games, e.g., Simone Biles, Katie Ledecky, etc.

The reason why athlete names didn’t match across the two data frames is that in one data frame, names are in UPPER CASE, and in the other, they’re in Title Case. Update the 2020 data frame to make name all upper case. Display the first 10 rows of team_usa_2020 with upper case names.

Important

Your answer must use the str_to_upper() function.

Let’s try that question again: How many athletes participated in both Olympic Games?
How many athletes participated in the 2020 Olympic Games but not the 2024 Olympic Games? How many athletes participated in the 2024 Olympic Games but not the 2020 Olympic Games?

Part 2 - DatasauRus

The data frame you will be working with in this part is called datasaurus_dozen and it’s in the datasauRus package. This single data frame contains 13 datasets, designed to show us why data visualization is important and how summary statistics alone can be misleading. The different datasets are marked by the dataset variable, as shown in Figure 2.

Figure 2: The `datasaurus_dozen` data frame stacks 13 datasets on top of each other. This figure shows the first three datasets.

Note

If it’s confusing that the data frame is called datasaurus_dozen when it contains 13 datasets, you’re not alone! Have you heard of a baker’s dozen?

Here is a peek at the top 10 rows of the dataset:

datasaurus_dozen

# A tibble: 1,846 × 3
   dataset     x     y
   <chr>   <dbl> <dbl>
 1 dino     55.4  97.2
 2 dino     51.5  96.0
 3 dino     46.2  94.5
 4 dino     42.8  91.4
 5 dino     40.8  88.3
 6 dino     38.7  84.9
 7 dino     35.6  79.9
 8 dino     33.1  77.6
 9 dino     29.0  74.5
10 dino     26.2  71.4
# ℹ 1,836 more rows

Question 3

In a single pipeline, calculate the mean of x, mean of y, standard deviation of x, standard deviation of y, and the correlation between x and y for each level of the dataset variable. Then, in 1-2 sentences, comment on how these summary statistics compare across groups (datasets).

Tip

There are 13 groups but tibbles only print out 10 rows by default. To display all rows, add print(n = 13) as the last step of your pipeline.

Question 4

Create a scatterplot of y versus x and color and facet it by dataset. Then, in 1-2 sentences, how these plots compare across groups (datasets). How does your response in this question compare to your response to the previous question and what does this say about using visualizations and summary statistics when getting to know a dataset?

Tip

When you both color and facet by the same variable, you’ll end up with a redundant legend. Turn off the legend by adding show.legend = FALSE to the geom creating the legend.

Render, commit, and push your changes. Make sure that you commit and push all changed documents and that your Git pane is completely empty before proceeding.

Part 3 - Medical marijuana in NC

SurveyUSA polled 900 NC adults between September 4-7, 2024. Of the 900 NC adults, 771 were identified by SurveyUSA as being registered to vote.¹ The following question was asked to these 771 adults:

Should the use of marijuana for medical use remain against the law in North Carolina? Or be legalized?

Responses were broken down into the following categories:

Variable	Levels
Age	18-49; 50+
Opinion	Remain Against The Law; Be Made Legal; Not Sure

Of the 771 responses, 391 were between the ages of 18-49. Of the individuals that are between 18-49, 59 individuals responded that they think medical marijuana should remain against the law, 292 said it should be made legal, and the remainder were not sure. Of the individuals that are 50+, 67 individuals responded that they think medical marijuana should remain against the law, 245 said it should be made legal, and the remainder were not sure.

Question 5

Fill in the code below to create a two-way table that summarizes these data.

survey_counts <- tibble( 
  age = c(),
  opinion = c(),
  n = c()
  )

survey_counts |>
  pivot_wider( 
    names_from = ___,
    values_from = ___
  )

For parts b-d below, use a single pipeline starting with survey_counts, calculate the desired proportions, and make sure the result is an ungrouped data frame with a column for relevant counts, a column for relevant proportions, and a column for the groups you’re interested in.

Calculate the proportions of 18-49 year olds and 50+ year-olds in this sample.
Calculate the proportions of those who think medical marijuana should remain against the law, should be made legal, and who are not sure.
Calculate the proportions of individuals in this sample who think medical marijuana should remain against the law, should be made legal, and who are not sure
- among those who are 18-49 years old and
- among those who are 50+ years old.

Question 6

Create a visualization that can be used to evaluate the relationship between age and opinion on legalizing medical marijuana in North Carolina based on this survey’s results.

Tip

Your visualization should display the proportions you calculated in Question 5d.
Based on your calculations so far, as well as your visualization, write 1-2 sentences that describe the relationship, in this sample, between age and opinion on legalizing medical marijuana in North Carolina.

Render, commit, and push one last time. Make sure that you commit and push all changed documents and that your Git pane is completely empty before proceeding.

Part 4 - `mtcars`

In this part, you’ll work with one of the most basic and overused datasets in R: mtcars. The data in this dataset come from the 1974 Motor Trend US magazine (so, yes, they’re old!) and provide information on fuel efficiency and other car characteristics.

Question 7

Since the dataset is used in many code examples, it’s not unexpected that some analyses of the data are good and some not so much.

Tip

For both parts of this question, you should review the data dictionary that is in the documentation for the dataset which you can find at https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html or by typing ?mtcars in your Console.

a. You come across the following visualization of these data. First, determine what is wrong with this visualization and describe it in one sentence. Then, fix and improve the visualization. As part of your improvement, make sure your legend

is on top of the plot,
is informative, and
lists levels in the order they appear in the plot.

ggplot(mtcars, aes(x = wt, y = mpg, color = am)) +
  geom_point() +
  labs(
    x = "Weight (1000 lbs)",
    y = "Miles / gallon"
  )

b. Update your plot from part (a) further, this time using different shaped points for cars with V-shaped and straight engines. Once again, some requirements for your legend – it should be informative and on the right of the plot.

Question 8

Your task is to make your plot from Question 7b as ugly and as ineffective as possible. Change colors, axes, fonts, themes, or anything else you can think of in the code chunk below. You can also search online for other themes, fonts, etc. that you want to tweak. Try to make it as ugly as possible, the sky is the limit! You must make at least 5 updates to the plot.

Your answer must include

a list of the at least 5 updates you’ve made to your plot from Question 7b, and
1-2 sentence explanation of why the plot you created is ugly (to you, at least) and ineffective.

Render, commit, and push your work so far. Make sure that you commit and push all changed documents and that your Git pane is completely empty before proceeding.

Part 5 - All about Quarto

Question 9

You have the following code chunk:

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) + 
  geom_point()

Add the following code cell options, one at a time, and set each to false and then to true. After each value, render your document and observe its effect. Ultimately, choose the values that are the most appropriate for this code cell. Based on the behaviors you observe, describe what each code cell option does.

echo
warning
eval

Question 10

You have the following code cell again.

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) + 
  geom_point()

Add fig-width and fig-asp as code chunk options. Try setting fig-width to values between 1 and 10. Try setting fig-asp to values between 0.1 and 1. Re-render the document after each value and observe its effect. Ultimately, choose values that make the plot look visually pleasing in the rendered document. Based on the behavior you observe, describe what each chunk option does.

Tip

Now that you’ve had more practice with figure sizing in Quarto documents, review all of the plots you made in this lab and adjust their widths and aspect rations to improve how they look in your rendered document.

b. You have the following code cell, but look carefully, it’s not exactly the same!

gplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) + 
  geom_point()

Add error as a code chunk option and set it to false and then set it to true. After each value, render your document and observe its effect. Ultimately, choose the value that allows you to render your document without altering the code. Based on the behavior you observe, describe what this code chunk option does.

Tip

Reading the documentation might also be helpful.

Render, commit, and push your work. Make sure that you commit and push all changed documents and that your Git pane is completely empty before proceeding.

Footnotes

Full survey results can be found at https://www.surveyusa.com/client/PollReport.aspx?g=c6995e17-3837-413e-ac98-3684e1c74dc1.↩︎

Introduction

Learning objectives

Getting started

Step 1: Log in to RStudio

Step 2: Clone the repo & start a new RStudio project

Step 3: Update the YAML

Packages

Guidelines

Questions

Part 1 - Team USA at the Olympics

Question 1

Question 2

Part 2 - DatasauRus

Question 3

Question 4

Part 3 - Medical marijuana in NC

Question 5

Question 6

Part 4 - mtcars

Question 7

Question 8

Part 5 - All about Quarto

Question 9

Question 10

Footnotes

Part 4 - `mtcars`