Lecture 19
Duke University
STA 199 - Fall 2024
November 7, 2024
We have been studying regression:
What combinations of data types have we seen?
What did the picture look like?
Numerical response and one numerical predictor:
Numerical response and one categorical predictor (two levels):
Numerical response; numerical and categorical predictors:
\[ y = \begin{cases} 1 & &&\text{eg. Yes, Win, True, Heads, Success}\\ 0 & &&\text{eg. No, Lose, False, Tails, Failure}. \end{cases} \]
If we can model the relationship between predictors (\(x\)) and a binary response (\(y\)), we can use the model to do a special kind of prediction called classification.
\[ \mathbf{x}: \text{word and character counts in an e-mail.} \]
\[ y = \begin{cases} 1 & \text{it's spam}\\ 0 & \text{it's legit} \end{cases} \]
Ethical concerns?
\[ \mathbf{x}: \text{features in a medical image.} \]
\[ y = \begin{cases} 1 & \text{it's cancer}\\ 0 & \text{it's healthy} \end{cases} \]
Ethical concerns?
\[ \mathbf{x}: \text{financial and demographic info about a loan applicant.} \]
\[ y = \begin{cases} 1 & \text{applicant is at risk of defaulting on loan}\\ 0 & \text{applicant is safe} \end{cases} \]
Ethical concerns?
\[ \mathbf{x}: \text{info about a criminal suspect and their case.} \]
\[ y = \begin{cases} 1 & \text{suspect is at risk of re-offending pre-trial}\\ 0 & \text{suspect is safe} \end{cases} \]
Ethical concerns?
Instead of modeling \(y\) directly, we model the probability that \(y=1\):
Recall regression with a numerical response:
Similar when modeling a binary response:
It’s the logistic function:
\[ \text{Prob}(y = 1) = \frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}. \]
If you set p = Prob(y = 1) and do some algebra, you get the simple linear model for the log-odds:
\[ \log\left(\frac{p}{1-p}\right) = \beta_0+\beta_1x. \]
This is called the logistic regression model.
p = Prob(y = 1) is a probability. A number between 0 and 1;
p / (1 - p) is the odds. A number between 0 and \(\infty\);
“The odds of this lecture going well are 10 to 1.”
\[ \log\left(\frac{p}{1-p}\right) = \beta_0+\beta_1x. \]
The logit function log(p / (1-p)) is an example of a link function that transforms the linear model to have an appropriate range;
This is an example of a generalized linear model;
We estimate the parameters \(\beta_0,\,\beta_1\) using maximum likelihood (don’t worry about it) to get the “best fitting” S-curve;
The fitted model is
\[ \log\left(\frac{\widehat{p}}{1-\widehat{p}}\right) = b_0+b_1x. \]
Select a number \(0 < p^* < 1\):
Solve for the x-value that matches the threshold:
A new person shows up with \(x_{\text{new}}\). Which side of the boundary are they on?
A new person shows up with \(x_{\text{new}}\). Which side of the boundary are they on?
Two numerical predictors and one binary response:
On the probability scale:
\[ \text{Prob}(y = 1) = \frac{e^{\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_mx_m}}{1+e^{\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_mx_m}}. \]
For the log-odds, a multiple linear regression:
\[ \log\left(\frac{p}{1-p}\right) = \beta_0+\beta_1x_1+\beta_2x_2+...+\beta_mx_m. \]
It’s linear! Consider two numerical predictors: