AE 17: Forest classification
In this application exercise, we will
- Split our data into testing and training
- Fit logistic regression regression models to testing data to classify outcomes
- Evaluate performance of models on testing data
We will use tidyverse and tidymodels for data exploration and modeling, respectively, and the forested package for the data.
Remember from the lecture that the forested
dataset contains information on whether a plot is forested (Yes
) or not (No
) as well as numerical and categorical features of that plot.
glimpse(forested)
Rows: 7,107
Columns: 19
$ forested <fct> Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
$ year <dbl> 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005,…
$ elevation <dbl> 881, 113, 164, 299, 806, 736, 636, 224, 52, 2240, 104…
$ eastness <dbl> 90, -25, -84, 93, 47, -27, -48, -65, -62, -67, 96, -4…
$ northness <dbl> 43, 96, 53, 34, -88, -96, 87, -75, 78, -74, -26, 86, …
$ roughness <dbl> 63, 30, 13, 6, 35, 53, 3, 9, 42, 99, 51, 190, 95, 212…
$ tree_no_tree <fct> Tree, Tree, Tree, No tree, Tree, Tree, No tree, Tree,…
$ dew_temp <dbl> 0.04, 6.40, 6.06, 4.43, 1.06, 1.35, 1.42, 6.39, 6.50,…
$ precip_annual <dbl> 466, 1710, 1297, 2545, 609, 539, 702, 1195, 1312, 103…
$ temp_annual_mean <dbl> 6.42, 10.64, 10.07, 9.86, 7.72, 7.89, 7.61, 10.45, 10…
$ temp_annual_min <dbl> -8.32, 1.40, 0.19, -1.20, -5.98, -6.00, -5.76, 1.11, …
$ temp_annual_max <dbl> 12.91, 15.84, 14.42, 15.78, 13.84, 14.66, 14.23, 15.3…
$ temp_january_min <dbl> -0.08, 5.44, 5.72, 3.95, 1.60, 1.12, 0.99, 5.54, 6.20…
$ vapor_min <dbl> 78, 34, 49, 67, 114, 67, 67, 31, 60, 79, 172, 162, 70…
$ vapor_max <dbl> 1194, 938, 754, 1164, 1254, 1331, 1275, 944, 892, 549…
$ canopy_cover <dbl> 50, 79, 47, 42, 59, 36, 14, 27, 82, 12, 74, 66, 83, 6…
$ lon <dbl> -118.6865, -123.0825, -122.3468, -121.9144, -117.8841…
$ lat <dbl> 48.69537, 47.07991, 48.77132, 45.80776, 48.07396, 48.…
$ land_type <fct> Tree, Tree, Tree, Tree, Tree, Tree, Non-tree vegetati…
Spending your data
Split your data into testing and training in a reproducible manner and display the split object.
# add code here
What percent of the original forested
data is allocated to training and what percent to testing? Compare your response to your neighbor’s. Are the percentages roughly consistent? What determines this in the initial_split()
? How would the code need to be updated to allocate 80% of the data to training and the remaining 20% to testing?
# add code here
Add response here.
# add code here
Let’s stick with the default split and save our testing and training data.
# add code here
Exploratory data analysis
Create a visualization that explores the relationship between the outcome, one numerical predictor, and one categorical predictor. Then, describe, in a few sentences, what the visualization shows about the relationship between these variables.
Note: Pay attention to which dataset you use for your exploration.
# add code here
Add response here.
Model 1: Custom choice of predictors
Fit
Fit a model for classifying plots as forested or not based on a subset of predictors of your choice. Name the model forested_custom_fit
and display a tidy output of the model.
# add code here
Predict
Predict for the testing data using this model.
# add code here
Evaluate
Calculate the false positive and false negative rates for the testing data using this model.
# add code here
Another commonly used display of this information is a confusion matrix. Create this using the conf_mat()
function. You will need to review the documentation for the function to determine how to use it.
# add code here
Sensitivity, specificity, ROC curve
Calculate sensitivity and specificity and draw the ROC curve.
# add code here
Model 2: All predictors
Fit
Fit a model for classifying plots as forested or not based on all predictors available. Name the model forested_full_fit
and display a tidy output of the model.
# add code here
Predict
Predict for the testing data using this model.
# add code here
Evaluate
Calculate the false positive and false negative rates for the testing data using this model.
# add code here
Sensitivity, specificity, ROC curve
Calculate sensitivity and specificity and draw the ROC curve.
# add code here
Model 1 vs. Model 2
Plot both ROC curves and articulate how you would use them to compare these models.
# add code here
Add response here.