Lab 4
Everything so far I
Introduction
In this lab, you’ll review topics you’ve worked with in previous labs, practice importing data, and dive into the concepts of data types and classes.
This lab assumes you’ve completed the labs so far and doesn’t repeat setup and overview content from those labs. If you haven’t done those yet, you should review the previous labs before starting on this one.
Learning objectives
By the end of the lab, you will…
- Learn to read data in from Excel spreadsheets
- Gain more experience with joining and pivoting data frames
- Review Quarto cell options
And, as usual, you will also…
- Get more experience with data science workflow using R, RStudio, Git, and GitHub
- Further your reproducible authoring skills with Quarto
- Improve your familiarity with version control using Git and GitHub
Getting started
Log in to RStudio, clone your lab-4
repo from GitHub, open your lab-4.qmd
document, and get started!
Step 1: Log in to RStudio
- Go to https://cmgr.oit.duke.edu/containers and log in with your Duke NetID and Password.
- Click
STA198-199
under My reservations to log into your container. You should now see the RStudio environment.
Step 2: Clone the repo & start a new RStudio project
Go to the course organization at github.com/sta199-f24 organization on GitHub. Click on the repo with the prefix lab-4. It contains the starter documents you need to complete the lab.
Click on the green CODE button and select Use SSH. This might already be selected by default; if it is, you’ll see the text Clone with SSH. Click on the clipboard icon to copy the repo URL.
In RStudio, go to File ➛ New Project ➛Version Control ➛ Git.
Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.
Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.
Click lab-4.qmd to open the template Quarto file. This is where you will write up your code and narrative for the lab.
Step 3: Update the YAML
In lab-4.qmd
, update the author
field to your name, render your document and examine the changes. Then, in the Git pane, click on Diff to view your changes, add a commit message (e.g., “Added author name”), and click Commit. Then, push the changes to your GitHub repository, and in your browser confirm that these changes have indeed propagated to your repository.
If you run into any issues with the first steps outlined above, flag a TA for help before proceeding.
Packages
In this lab, we will work with the
- tidyverse package for doing data analysis in a “tidy” way,
- readxl package for reading in Excel files,
- janitor package for cleaning up variable names, and
- palmerpenguins and datasauRus packages for some datasets
-
Run the code cell by clicking on the green triangle (play) button for the code cell labeled
load-packages
. This loads the package so that its features (the functions and datasets in it) are accessible from your Console. - Then, render the document that loads this package to make its features (the functions and datasets in it) available for other code cells in your Quarto document.
Guidelines
As we’ve discussed in lecture, your plots should include an informative title, axes and legends should have human-readable labels, and careful consideration should be given to aesthetic choices.
Additionally, code should follow the tidyverse style. Particularly,
there should be spaces before and line breaks after each
+
when building aggplot
,there should also be spaces before and line breaks after each
|>
in a data transformation pipeline,code should be properly indented,
there should be spaces around
=
signs and spaces after commas.
Furthermore, all code should be visible in the PDF output, i.e., should not run off the page on the PDF. Long lines that run off the page should be split across multiple lines with line breaks.
Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub.
Continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
Starting with this lab, you are expected to pay attention to code smell in addition to code style and readability. You should review and improve your code to avoid redundant steps (e.g., grouping, ungrouping, and grouping again by the same variable in a pipeline), using inconsistent syntax (e.g., !
to say “not” in one place and -
in another place), etc.
Questions
Part 1 - Team USA at the Olympics
For this part, you’ll work with data from the rosters of Team USA from the 2020 and 2024 Olympics. The data come from https://www.teamusa.com and the rosters for the two games are in a single Excel file (team-usa.xlsx
in your data
folder), accross two separate spreadsheets within that file. Figure 1 shows screenshots of these spreadsheets.
Your goal is to answer questions about athletes who competed in both games and only one of the games.
Question 1
- Read data from the two sheets of
team-usa.xlsx
as two separate data frames calledteam_usa_2020
andteam_usa_2024
.
The names of the sheets are shown in the screenshots in Figure 1, or you can use the excel_sheets()
function to discover them. Additionally, note that the first row of the sheets contain a logo and a title describing the contents of the data, and not the header row containing variable names.
Read the documentation for the
clean_names()
function from the janitor package at https://sfirke.github.io/janitor/reference/clean_names.html. Use this function to “clean” the variable names ofteam_usa_2020
andteam_usa_2024
and save the data frames with the new variable names.-
Create a new variable in both of the datasets called
name
that:-
paste()
s together thefirst_name
andlast_name
variables with a space in between and - is the first variable in the resulting data frame.
-
Using the appropriate
*_join()
function, determine how many athletes participated in both Olympic Games?
Your answer to this question, based on the data frames you created, should be 0, even if it doesn’t make sense in context of actual Olympic athletes.
Question 2
If you have even a passing knowledge of the Olympic Games, you might know that there are some athletes that participated in both the 2020 and 2024 games, e.g., Simone Biles, Katie Ledecky, etc.
- The reason why athlete names didn’t match across the two data frames is that in one data frame, names are in UPPER CASE, and in the other, they’re in Title Case. Update the 2020 data frame to make
name
all upper case. Display the first 10 rows ofteam_usa_2020
with upper case names.
Your answer must use the str_to_upper()
function.
Let’s try that question again: How many athletes participated in both Olympic Games?
How many athletes participated in the 2020 Olympic Games but not the 2024 Olympic Games? How many athletes participated in the 2024 Olympic Games but not the 2020 Olympic Games?
Part 2 - DatasauRus
The data frame you will be working with in this part is called datasaurus_dozen
and it’s in the datasauRus package. This single data frame contains 13 datasets, designed to show us why data visualization is important and how summary statistics alone can be misleading. The different datasets are marked by the dataset
variable, as shown in Figure 2.
If it’s confusing that the data frame is called datasaurus_dozen
when it contains 13 datasets, you’re not alone! Have you heard of a baker’s dozen?
Here is a peek at the top 10 rows of the dataset:
datasaurus_dozen
# A tibble: 1,846 × 3
dataset x y
<chr> <dbl> <dbl>
1 dino 55.4 97.2
2 dino 51.5 96.0
3 dino 46.2 94.5
4 dino 42.8 91.4
5 dino 40.8 88.3
6 dino 38.7 84.9
7 dino 35.6 79.9
8 dino 33.1 77.6
9 dino 29.0 74.5
10 dino 26.2 71.4
# ℹ 1,836 more rows
Question 3
In a single pipeline, calculate the mean of x
, mean of y
, standard deviation of x
, standard deviation of y
, and the correlation between x
and y
for each level of the dataset
variable. Then, in 1-2 sentences, comment on how these summary statistics compare across groups (datasets).
There are 13 groups but tibble
s only print out 10 rows by default. To display all rows, add print(n = 13)
as the last step of your pipeline.
Question 4
Create a scatterplot of y
versus x
and color and facet it by dataset
. Then, in 1-2 sentences, how these plots compare across groups (datasets). How does your response in this question compare to your response to the previous question and what does this say about using visualizations and summary statistics when getting to know a dataset?
When you both color and facet by the same variable, you’ll end up with a redundant legend. Turn off the legend by adding show.legend = FALSE
to the geom creating the legend.
Render, commit, and push your changes. Make sure that you commit and push all changed documents and that your Git pane is completely empty before proceeding.
Part 3 - Medical marijuana in NC
SurveyUSA polled 900 NC adults between September 4-7, 2024. Of the 900 NC adults, 771 were identified by SurveyUSA as being registered to vote.1 The following question was asked to these 771 adults:
Should the use of marijuana for medical use remain against the law in North Carolina? Or be legalized?
Responses were broken down into the following categories:
Variable | Levels |
---|---|
Age | 18-49; 50+ |
Opinion | Remain Against The Law; Be Made Legal; Not Sure |
Of the 771 responses, 391 were between the ages of 18-49. Of the individuals that are between 18-49, 59 individuals responded that they think medical marijuana should remain against the law, 292 said it should be made legal, and the remainder were not sure. Of the individuals that are 50+, 67 individuals responded that they think medical marijuana should remain against the law, 245 said it should be made legal, and the remainder were not sure.
Question 5
-
Fill in the code below to create a two-way table that summarizes these data.
<- tibble( survey_counts age = c(), opinion = c(), n = c() ) |> survey_counts pivot_wider( names_from = ___, values_from = ___ )
For parts b-d below, use a single pipeline starting with survey_counts
, calculate the desired proportions, and make sure the result is an ungrouped data frame with a column for relevant counts, a column for relevant proportions, and a column for the groups you’re interested in.
Calculate the proportions of 18-49 year olds and 50+ year-olds in this sample.
Calculate the proportions of those who think medical marijuana should remain against the law, should be made legal, and who are not sure.
-
Calculate the proportions of individuals in this sample who think medical marijuana should remain against the law, should be made legal, and who are not sure
- among those who are 18-49 years old and
- among those who are 50+ years old.
Question 6
-
Create a visualization that can be used to evaluate the relationship between
age
andopinion
on legalizing medical marijuana in North Carolina based on this survey’s results.TipYour visualization should display the proportions you calculated in Question 5d.
Based on your calculations so far, as well as your visualization, write 1-2 sentences that describe the relationship, in this sample, between age and opinion on legalizing medical marijuana in North Carolina.
Render, commit, and push one last time. Make sure that you commit and push all changed documents and that your Git pane is completely empty before proceeding.
Part 4 - mtcars
In this part, you’ll work with one of the most basic and overused datasets in R: mtcars
. The data in this dataset come from the 1974 Motor Trend US magazine (so, yes, they’re old!) and provide information on fuel efficiency and other car characteristics.
Question 7
Since the dataset is used in many code examples, it’s not unexpected that some analyses of the data are good and some not so much.
For both parts of this question, you should review the data dictionary that is in the documentation for the dataset which you can find at https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html or by typing ?mtcars
in your Console.
a. You come across the following visualization of these data. First, determine what is wrong with this visualization and describe it in one sentence. Then, fix and improve the visualization. As part of your improvement, make sure your legend
- is on top of the plot,
- is informative, and
- lists levels in the order they appear in the plot.
ggplot(mtcars, aes(x = wt, y = mpg, color = am)) +
geom_point() +
labs(
x = "Weight (1000 lbs)",
y = "Miles / gallon"
)
b. Update your plot from part (a) further, this time using different shaped points for cars with V-shaped and straight engines. Once again, some requirements for your legend – it should be informative and on the right of the plot.
Question 8
Your task is to make your plot from Question 7b as ugly and as ineffective as possible. Change colors, axes, fonts, themes, or anything else you can think of in the code chunk below. You can also search online for other themes, fonts, etc. that you want to tweak. Try to make it as ugly as possible, the sky is the limit! You must make at least 5 updates to the plot.
Your answer must include
a list of the at least 5 updates you’ve made to your plot from Question 7b, and
1-2 sentence explanation of why the plot you created is ugly (to you, at least) and ineffective.
Render, commit, and push your work so far. Make sure that you commit and push all changed documents and that your Git pane is completely empty before proceeding.
Part 5 - All about Quarto
Question 9
You have the following code chunk:
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point()
Add the following code cell options, one at a time, and set each to false
and then to true
. After each value, render your document and observe its effect. Ultimately, choose the values that are the most appropriate for this code cell. Based on the behaviors you observe, describe what each code cell option does.
echo
warning
eval
Question 10
- You have the following code cell again.
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point()
Add fig-width
and fig-asp
as code chunk options. Try setting fig-width
to values between 1 and 10. Try setting fig-asp
to values between 0.1 and 1. Re-render the document after each value and observe its effect. Ultimately, choose values that make the plot look visually pleasing in the rendered document. Based on the behavior you observe, describe what each chunk option does.
Now that you’ve had more practice with figure sizing in Quarto documents, review all of the plots you made in this lab and adjust their widths and aspect rations to improve how they look in your rendered document.
b. You have the following code cell, but look carefully, it’s not exactly the same!
gplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point()
Add error
as a code chunk option and set it to false
and then set it to true
. After each value, render your document and observe its effect. Ultimately, choose the value that allows you to render your document without altering the code. Based on the behavior you observe, describe what this code chunk option does.
Reading the documentation might also be helpful.
Render, commit, and push your work. Make sure that you commit and push all changed documents and that your Git pane is completely empty before proceeding.
Footnotes
Full survey results can be found at https://www.surveyusa.com/client/PollReport.aspx?g=c6995e17-3837-413e-ac98-3684e1c74dc1.↩︎