Inference overview

Lecture 23

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Fall 2024

November 21, 2024

Warm-up

While you wait…

Go to your ae project in RStudio.
Make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.
Click Pull to get today’s application exercise file: ae-19-equality-randomization.qmd.
Wait till the you’re prompted to work on the application exercise during class before editing the file.

From last time: Randomization

Packages

# load packages
library(tidyverse)   # for data wrangling and visualization
library(tidymodels)  # for modeling
library(openintro)   # for Duke Forest dataset
library(scales)      # for pretty axis labels
library(glue)        # for constructing character strings
library(knitr)       # for neatly formatted tables

Data: Houses in Duke Forest

Data on houses that were sold in the Duke Forest neighborhood of Durham, NC around November 2020
Scraped from Zillow
Source: openintro::duke_forest

Home in Duke Forest

Setting hypotheses

Null hypothesis, $H_0$: “There is nothing going on.” The slope of the model for predicting the prices of houses in Duke Forest from their areas is 0, $\beta_1 = 0$.
Alternative hypothesis, $H_A$: “There is something going on”. The slope of the model for predicting the prices of houses in Duke Forest from their areas is different than 0, $\beta_1 \ne 0$.

Calculate observed slope

… which we have already done:

observed_fit <- duke_forest |>
  specify(price ~ area) |>
  fit()

observed_fit

# A tibble: 2 × 2
  term      estimate
  <chr>        <dbl>
1 intercept  116652.
2 area          159.

Simulate null distribution

set.seed(20241118)
null_dist <- duke_forest |>
  specify(price ~ area) |>
  hypothesize(null = "independence") |>
  generate(reps = 100, type = "permute") |>
  fit()

View null distribution

null_dist

# A tibble: 200 × 3
# Groups:   replicate [100]
   replicate term        estimate
       <int> <chr>          <dbl>
 1         1 intercept 547294.   
 2         1 area           4.54 
 3         2 intercept 568599.   
 4         2 area          -3.13 
 5         3 intercept 561547.   
 6         3 area          -0.593
 7         4 intercept 526286.   
 8         4 area          12.1  
 9         5 intercept 651476.   
10         5 area         -33.0  
# ℹ 190 more rows

Visualize null distribution

null_dist |>
  filter(term == "area") |>
  ggplot(aes(x = estimate)) +
  geom_histogram(binwidth = 15)

Visualize null distribution (alternative)

visualize(null_dist) +
  shade_p_value(obs_stat = observed_fit, direction = "two-sided")

Get p-value

null_dist |>
  get_p_value(obs_stat = observed_fit, direction = "two-sided")

Warning: Please be cautious in reporting a p-value of 0. This result is an
approximation based on the number of `reps` chosen in the
`generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Please be cautious in reporting a p-value of 0. This result is an
approximation based on the number of `reps` chosen in the
`generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.

# A tibble: 2 × 2
  term      p_value
  <chr>       <dbl>
1 area            0
2 intercept       0

Make a decision

Based on the p-value calculated, what is the conclusion of the hypothesis test?

Inference for a mean

Estimating the average price of houses in Duke Forest

Estimate the average price of houses in Duke Forest with a 95% confidence interval.

Computing the CI for the mean I

Calculate the observed mean:

observed_mean <- duke_forest |>
  specify(response = price) |>
  calculate(stat = "mean")

observed_mean

Response: price (numeric)
# A tibble: 1 × 1
     stat
    <dbl>
1 559899.

Computing the CI for the mean II

Take 100 bootstrap samples and calculate the mean of each one:

set.seed(1121)

boot_means <- duke_forest |>
  specify(response = price) |>
  generate(reps = 100, type = "bootstrap") |>
  calculate(stat = "mean")

boot_means

Response: price (numeric)
# A tibble: 100 × 2
   replicate    stat
       <int>   <dbl>
 1         1 591471.
 2         2 545975.
 3         3 588256.
 4         4 569751.
 5         5 566394.
 6         6 583654.
 7         7 533031.
 8         8 575321.
 9         9 559893.
10        10 588826.
# ℹ 90 more rows

Computing the CI for the mean III

Compute the 95% CI as the middle 95% of the bootstrap distribution:

get_confidence_interval(
  boot_means, 
  point_estimate = observed_mean, 
  level = 0.95,
  type = "percentile"
)

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1  521923.  605514.

Making a decision about the average price of houses in Duke Forest

An article in the Durham Herald Sun states that the average price of a house in Duke Forest is $600,000. Do these data provide convincing evidence to refute this claim?

Setting the hypotheses

Define $\mu$ as the true average price of all houses in Duke Forest:

$H_0: \mu = 600000$ - The true average price of all houses in Duke Forest is $600,000 (as claimed by the Durham Herald Sun, i.e., there’s nothing going on)
$H_A: \mu \ne 600000$ - The true average price of all houses in Duke Forest is different than $600,000 (refuting the claim by the Durham Herald Sun, i.e., there is something going on)

Calculate the observed

Well, we already did this!

observed_mean

Response: price (numeric)
# A tibble: 1 × 1
     stat
    <dbl>
1 559899.

Simulate the null distribution

set.seed(1121)

null_means <- duke_forest |>
  specify(response = price) |>
  hypothesize(null = "point", mu = 600000) |>
  generate(reps = 100, type = "bootstrap") |>
  calculate(stat = "mean")

null_means

Response: price (numeric)
Null Hypothesis: point
# A tibble: 100 × 2
   replicate    stat
       <int>   <dbl>
 1         1 631572.
 2         2 586077.
 3         3 628357.
 4         4 609853.
 5         5 606495.
 6         6 623755.
 7         7 573132.
 8         8 615423.
 9         9 599994.
10        10 628927.
# ℹ 90 more rows

Visualize the null distribution

visualize(null_means)

Calculate the p-value

Probability of observed or lower outcome, given the given the null hypothesis is true: $P(\bar{x} < 559899 ~ | ~ \mu = 600000)$

Probability of observed or more extreme outcome, given the null hypothesis is true:

\[ 2 \times P(\bar{x} < 559899 ~ | ~ \mu = 600000) \]

null_means |>
  get_p_value(
    obs_stat = observed_mean, 
    direction = "two-sided"
  )

# A tibble: 1 × 1
  p_value
    <dbl>
1    0.02

Visualize the p-value

visualize(null_means) +
  shade_p_value(
    obs_stat = observed_mean, 
    direction = "two-sided"
  )

Application exercise

ae-19-equality-randomization

Go to your ae project in RStudio.
If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.
If you haven’t yet done so, click Pull to get today’s application exercise file: ae-19-equality-randomization.qmd.
Work through the application exercise in class, and render, commit, and push your edits.

Recap of AE

A hypothesis test is a statistical technique used to evaluate competing claims (null and alternative hypotheses) using data.
We simulate a null distribution using our original data.
We use our sample statistic and direction of the alternative hypothesis to calculate the p-value.
We use the p-value to determine conclusions about the alternative hypotheses.