11 Model building

11.1 Specify models

The process of specifying our models is always as follows:

  1. Pick a model type
  2. set the engine
  3. Set the mode: regression or classification

You can choose the model type and engine from this list.

11.1.1 Logistic regression

log_spec <- # your model specification
  logistic_reg() %>%  # model type
  set_engine(engine = "glm") %>%  # model engine
  set_mode("classification") # model mode

# Show your model specification
log_spec
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: glm

11.1.2 Random forest

library(ranger)

rf_spec <- 
  rand_forest() %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

When we set the engine, we add importance = "impurity". This will provide variable importance scores for this model, which gives some insight into which predictors drive model performance.

11.1.3 Boosted tree (XGBoost)

library(xgboost)

xgb_spec <- 
  boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("classification") 

11.1.4 K-nearest neighbor

knn_spec <- 
  nearest_neighbor(neighbors = 4) %>% # we can adjust the number of neighbors 
  set_engine("kknn") %>% 
  set_mode("classification") 

11.1.5 Neural network

To use the neural network model, you will need to install the following packages: keras. You will also need the python keras library installed (see ?keras::install_keras()).

library(keras)

nnet_spec <-
  mlp() %>%
  set_mode("classification") %>% 
  set_engine("keras", verbose = 0) 

We set the engine-specific verbose argument to prevent logging the results.

11.2 Create workflows

To combine the data preparation recipe with the model building, we use the package workflows. A workflow is an object that can bundle together your pre-processing recipe, modeling, and even post-processing requests (like calculating the RMSE).

11.2.1 Logistic regression

Bundle recipe and model with workflows:

log_wflow <- # new workflow object
 workflow() %>% # use workflow function
 add_recipe(housing_rec) %>%   # use the new recipe
 add_model(log_spec)   # add your model spec

# show object
log_wflow
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 7 Recipe Steps
#> 
#> ● step_log()
#> ● step_naomit()
#> ● step_novel()
#> ● step_normalize()
#> ● step_dummy()
#> ● step_zv()
#> ● step_corr()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: glm

11.2.2 Random forest

Bundle recipe and model:

rf_wflow <-
 workflow() %>%
 add_recipe(housing_rec) %>% 
 add_model(rf_spec) 

11.2.3 XGBoost

Bundle recipe and model:

xgb_wflow <-
 workflow() %>%
 add_recipe(housing_rec) %>% 
 add_model(xgb_spec)

11.2.4 K-nearest neighbor

Bundle recipe and model:

knn_wflow <-
 workflow() %>%
 add_recipe(housing_rec) %>% 
 add_model(knn_spec)

11.2.5 Neural network

Bundle recipe and model:

nnet_wflow <-
 workflow() %>%
 add_recipe(housing_rec) %>% 
 add_model(nnet_spec)

11.3 Evaluate models

Now we can use our validation set (cv_folds) to estimate the performance of our models using the fit_resamples() function to fit the models on each of the folds and store the results.

Note that fit_resamples() will fit our model to each resample and evaluate on the heldout set from each resample. The function is usually only used for computing performance metrics across some set of resamples to evaluate our models (like accuracy) - the models are not even stored. However, in our example we save the predictions in order to visualize the model fit and residuals with control_resamples(save_pred = TRUE).

Finally, we collect the performance metrics with collect_metrics() and pick the model that does best on the validation set.

11.3.1 Logistic regression

We use our workflow object to perform resampling. Furthermore, we use metric_set()to choose some common classification performance metrics provided by the yardstick package. Visit yardsticks reference to see the complete list of all possible metrics.

Note that Cohen’s kappa coefficient (κ) is a similar measure to accuracy, but is normalized by the accuracy that would be expected by chance alone and is very useful when one or more classes have large frequency distributions. The higher the value, the better.

log_res <- 
  log_wflow %>% 
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      recall, precision, f_meas, 
      accuracy, kap,
      roc_auc, sens, spec),
    control = control_resamples(
      save_pred = TRUE)
    ) 

11.3.1.1 Model coefficients

The above described method to obtain log_res is fine if we are not interested in model coefficients. However, if we would like to extract the model coeffcients from fit_resamples, we need to proceed as follows:

# save model coefficients for a fitted model object from a workflow
get_model <- function(x) {
  pull_workflow_fit(x) %>% 
    tidy()
}

# same as before with one exception
log_res_2 <- 
  log_wflow %>% 
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      recall, precision, f_meas, 
      accuracy, kap,
      roc_auc, sens, spec),
    control = control_resamples(
      save_pred = TRUE,
      extract = get_model) # use extract and our new function
    ) 

log_res_2
#> # Resampling results
#> # 5-fold cross-validation using stratification 
#> # A tibble: 5 x 6
#>   splits          id    .metrics      .notes      .extracts     .predictions    
#>   <list>          <chr> <list>        <list>      <list>        <list>          
#> 1 <split [12.4K/… Fold1 <tibble [8 ×… <tibble [0… <tibble [1 ×… <tibble [3,097 …
#> 2 <split [12.4K/… Fold2 <tibble [8 ×… <tibble [0… <tibble [1 ×… <tibble [3,097 …
#> 3 <split [12.4K/… Fold3 <tibble [8 ×… <tibble [0… <tibble [1 ×… <tibble [3,096 …
#> 4 <split [12.4K/… Fold4 <tibble [8 ×… <tibble [0… <tibble [1 ×… <tibble [3,095 …
#> 5 <split [12.4K/… Fold5 <tibble [8 ×… <tibble [0… <tibble [1 ×… <tibble [3,095 …

Now there is a .extracts column with nested tibbles.

log_res_2$.extracts[[1]]
#> # A tibble: 1 x 2
#>   .extracts        .config             
#>   <list>           <chr>               
#> 1 <tibble [8 × 5]> Preprocessor1_Model1

To get the results use:

log_res_2$.extracts[[1]][[1]]
#> [[1]]
#> # A tibble: 8 x 5
#>   term                     estimate std.error statistic  p.value
#>   <chr>                       <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)                -2.02     0.0510  -39.6    0.      
#> 2 median_income              -1.91     0.0470  -40.6    0.      
#> 3 rooms_per_household         0.252    0.0376    6.70   2.14e-11
#> 4 population_per_household    0.485    0.0297   16.3    7.36e-60
#> 5 ocean_proximity_INLAND      2.92     0.0729   40.0    0.      
#> 6 ocean_proximity_ISLAND    -11.4    160.       -0.0714 9.43e- 1
#> # … with 2 more rows

All of the results can be flattened and collected using:

all_coef <- map_dfr(log_res_2$.extracts, ~ .x[[1]][[1]])

Show all of the resample coefficients for a single predictor:

filter(all_coef, term == "median_income")
#> # A tibble: 5 x 5
#>   term          estimate std.error statistic p.value
#>   <chr>            <dbl>     <dbl>     <dbl>   <dbl>
#> 1 median_income    -1.91    0.0470     -40.6       0
#> 2 median_income    -1.91    0.0469     -40.6       0
#> 3 median_income    -1.89    0.0469     -40.4       0
#> 4 median_income    -1.96    0.0480     -40.8       0
#> 5 median_income    -1.90    0.0468     -40.5       0

11.3.1.2 Performance metrics

Show average performance over all folds:

log_res %>%  collect_metrics(summarize = TRUE)
#> # A tibble: 8 x 6
#>   .metric   .estimator  mean     n  std_err .config             
#>   <chr>     <chr>      <dbl> <int>    <dbl> <chr>               
#> 1 accuracy  binary     0.849     5 0.00283  Preprocessor1_Model1
#> 2 f_meas    binary     0.883     5 0.00243  Preprocessor1_Model1
#> 3 kap       binary     0.672     5 0.00569  Preprocessor1_Model1
#> 4 precision binary     0.871     5 0.000987 Preprocessor1_Model1
#> 5 recall    binary     0.896     5 0.00430  Preprocessor1_Model1
#> 6 roc_auc   binary     0.918     5 0.00117  Preprocessor1_Model1
#> # … with 2 more rows

Show performance for every single fold:

log_res %>%  collect_metrics(summarize = FALSE)
#> # A tibble: 40 x 5
#>   id    .metric   .estimator .estimate .config             
#>   <chr> <chr>     <chr>          <dbl> <chr>               
#> 1 Fold1 recall    binary         0.886 Preprocessor1_Model1
#> 2 Fold1 precision binary         0.868 Preprocessor1_Model1
#> 3 Fold1 f_meas    binary         0.877 Preprocessor1_Model1
#> 4 Fold1 accuracy  binary         0.842 Preprocessor1_Model1
#> 5 Fold1 kap       binary         0.658 Preprocessor1_Model1
#> 6 Fold1 sens      binary         0.886 Preprocessor1_Model1
#> # … with 34 more rows

11.3.1.3 Collect predictions

To obtain the actual model predictions, we use the function collect_predictions and save the result as log_pred:

log_pred <- 
  log_res %>%
  collect_predictions()

11.3.1.4 Confusion matrix

Now we can use the predictions to create a confusion matrix with conf_mat():

log_pred %>%
  conf_mat(price_category, .pred_class) 
#>           Truth
#> Prediction above below
#>      above  8788  1306
#>      below  1025  4361

Additionally, the confusion matrix can quickly be visualized in different formats using autoplot(). Type mosaic:

log_pred %>%
  conf_mat(price_category, .pred_class) %>% 
  autoplot(type = "mosaic")

Or type heatmap:

log_pred %>%
  conf_mat(price_category, .pred_class) %>% 
  autoplot(type = "heatmap")

11.3.1.5 ROC-Curve

We can also make an ROC curve for our 5 folds. Since the category we are predicting is the first level in the price_category factor (“above”), we provide roc_curve() with the relevant class probability .pred_above:

log_pred %>%
  group_by(id) %>% # id contains our folds
  roc_curve(price_category, .pred_above) %>% 
  autoplot()

Visit Google developer’s Machine Learning Crashcourse to learn more about the ROC-Curve.

11.3.1.6 Probability distributions

Plot predicted probability distributions for our two classes.

log_res %>%
  collect_predictions() %>% 
  ggplot() +
  geom_density(aes(x = .pred_above, fill = price_category), 
               alpha = 0.5)

11.3.2 Random forest

We don’t repeat all of the steps shown in logistic regression and just focus on the performance metrics.

rf_res <-
  rf_wflow %>% 
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      recall, precision, f_meas, 
      accuracy, kap,
      roc_auc, sens, spec),
    control = control_resamples(save_pred = TRUE)
    ) 

rf_res %>%  collect_metrics(summarize = TRUE)
#> # A tibble: 8 x 6
#>   .metric   .estimator  mean     n std_err .config             
#>   <chr>     <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy  binary     0.858     5 0.00316 Preprocessor1_Model1
#> 2 f_meas    binary     0.889     5 0.00263 Preprocessor1_Model1
#> 3 kap       binary     0.691     5 0.00656 Preprocessor1_Model1
#> 4 precision binary     0.877     5 0.00185 Preprocessor1_Model1
#> 5 recall    binary     0.902     5 0.00435 Preprocessor1_Model1
#> 6 roc_auc   binary     0.926     5 0.00187 Preprocessor1_Model1
#> # … with 2 more rows

11.3.3 XGBoost

We don’t repeat all of the steps shown in logistic regression and just focus on the performance metrics.

xgb_res <- 
  xgb_wflow %>% 
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      recall, precision, f_meas, 
      accuracy, kap,
      roc_auc, sens, spec),
    control = control_resamples(save_pred = TRUE)
    ) 

xgb_res %>% collect_metrics(summarize = TRUE)
#> # A tibble: 8 x 6
#>   .metric   .estimator  mean     n  std_err .config             
#>   <chr>     <chr>      <dbl> <int>    <dbl> <chr>               
#> 1 accuracy  binary     0.859     5 0.00262  Preprocessor1_Model1
#> 2 f_meas    binary     0.890     5 0.00235  Preprocessor1_Model1
#> 3 kap       binary     0.694     5 0.00508  Preprocessor1_Model1
#> 4 precision binary     0.881     5 0.000480 Preprocessor1_Model1
#> 5 recall    binary     0.898     5 0.00485  Preprocessor1_Model1
#> 6 roc_auc   binary     0.928     5 0.00121  Preprocessor1_Model1
#> # … with 2 more rows

11.3.4 K-nearest neighbor

We don’t repeat all of the steps shown in logistic regression and just focus on the performance metrics.

knn_res <- 
  knn_wflow %>% 
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      recall, precision, f_meas, 
      accuracy, kap,
      roc_auc, sens, spec),
    control = control_resamples(save_pred = TRUE)
    ) 

knn_res %>% collect_metrics(summarize = TRUE)
#> # A tibble: 8 x 6
#>   .metric   .estimator  mean     n std_err .config             
#>   <chr>     <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy  binary     0.803     5 0.00437 Preprocessor1_Model1
#> 2 f_meas    binary     0.845     5 0.00361 Preprocessor1_Model1
#> 3 kap       binary     0.575     5 0.00926 Preprocessor1_Model1
#> 4 precision binary     0.844     5 0.00402 Preprocessor1_Model1
#> 5 recall    binary     0.846     5 0.00565 Preprocessor1_Model1
#> 6 roc_auc   binary     0.883     5 0.00289 Preprocessor1_Model1
#> # … with 2 more rows

11.3.5 Neural network

We don’t repeat all of the steps shown in logistic regression and just focus on the performance metrics.

nnet_res <- 
  nnet_wflow %>% 
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      recall, precision, f_meas, 
      accuracy, kap,
      roc_auc, sens, spec),
    control = control_resamples(save_pred = TRUE)
    ) 

nnet_res %>% collect_metrics(summarize = TRUE)
#> # A tibble: 8 x 6
#>   .metric   .estimator  mean     n std_err .config             
#>   <chr>     <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy  binary     0.857     5 0.00288 Preprocessor1_Model1
#> 2 f_meas    binary     0.888     5 0.00244 Preprocessor1_Model1
#> 3 kap       binary     0.689     5 0.00590 Preprocessor1_Model1
#> 4 precision binary     0.881     5 0.00173 Preprocessor1_Model1
#> 5 recall    binary     0.895     5 0.00428 Preprocessor1_Model1
#> 6 roc_auc   binary     0.927     5 0.00147 Preprocessor1_Model1
#> # … with 2 more rows

11.3.6 Compare models

Extract metrics from our models to compare them:

log_metrics <- 
  log_res %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "Logistic Regression") # add the name of the model to every row

rf_metrics <- 
  rf_res %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "Random Forest")

xgb_metrics <- 
  xgb_res %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "XGBoost")

knn_metrics <- 
  knn_res %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "Knn")

nnet_metrics <- 
  nnet_res %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "Neural Net")

# create dataframe with all models
model_compare <- bind_rows(log_metrics,
                           rf_metrics,
                           xgb_metrics,
                           knn_metrics,
                           nnet_metrics) 

# change data structure
model_comp <- 
  model_compare %>% 
  select(model, .metric, mean, std_err) %>% 
  pivot_wider(names_from = .metric, values_from = c(mean, std_err)) 

# show mean F1-Score for every model
model_comp %>% 
  arrange(mean_f_meas) %>% 
  mutate(model = fct_reorder(model, mean_f_meas)) %>% # order results
  ggplot(aes(model, mean_f_meas, fill=model)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "Blues") +
   geom_text(
     size = 3,
     aes(label = round(mean_f_meas, 2), y = mean_f_meas + 0.08),
     vjust = 1
  )

# show mean area under the curve (auc) per model
model_comp %>% 
  arrange(mean_roc_auc) %>% 
  mutate(model = fct_reorder(model, mean_roc_auc)) %>%
  ggplot(aes(model, mean_roc_auc, fill=model)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "Blues") + 
     geom_text(
     size = 3,
     aes(label = round(mean_roc_auc, 2), y = mean_roc_auc + 0.08),
     vjust = 1
  )

Note that the model results are all quite similar. In our example we choose the F1-Score as performance measure to select the best model. Let’s find the maximum mean F1-Score:

model_comp %>% slice_max(mean_f_meas)
#> # A tibble: 1 x 17
#>   model mean_accuracy mean_f_meas mean_kap mean_precision mean_recall
#>   <chr>         <dbl>       <dbl>    <dbl>          <dbl>       <dbl>
#> 1 XGBo…         0.859       0.890    0.694          0.881       0.898
#> # … with 11 more variables: mean_roc_auc <dbl>, mean_sens <dbl>,
#> #   mean_spec <dbl>, std_err_accuracy <dbl>, std_err_f_meas <dbl>,
#> #   std_err_kap <dbl>, std_err_precision <dbl>, std_err_recall <dbl>,
#> #   std_err_roc_auc <dbl>, std_err_sens <dbl>, std_err_spec <dbl>

Now it’s time to fit the best model one last time to the full training set and evaluate the resulting final model on the test set.

11.4 Last evaluation on test set

Tidymodels provides the function last_fit() which fits a model to the whole training data and evaluates it on the test set. We just need to provide the workflow object of the best model as well as the data split object (not the training data).

last_fit_rf <- last_fit(rf_wflow, 
                        split = data_split,
                        metrics = metric_set(
                          recall, precision, f_meas, 
                          accuracy, kap,
                          roc_auc, sens, spec)
                        )

Show performance metrics

last_fit_rf %>% 
  collect_metrics()
#> # A tibble: 8 x 4
#>   .metric   .estimator .estimate .config             
#>   <chr>     <chr>          <dbl> <chr>               
#> 1 recall    binary         0.893 Preprocessor1_Model1
#> 2 precision binary         0.878 Preprocessor1_Model1
#> 3 f_meas    binary         0.885 Preprocessor1_Model1
#> 4 accuracy  binary         0.853 Preprocessor1_Model1
#> 5 kap       binary         0.682 Preprocessor1_Model1
#> 6 sens      binary         0.893 Preprocessor1_Model1
#> # … with 2 more rows

And these are our final performance metrics. Remember that if a model fit to the training dataset also fits the test dataset well, minimal overfitting has taken place. This seems to be also the case in our example.

To learn more about the model we can access the variable importance scores via the .workflow column. We first need to pluck out the first element in the workflow column, then pull out the fit from the workflow object. Finally, the vip package helps us visualize the variable importance scores for the top features. Note that we can’t create this type of plot for every model engine.

library(vip)

last_fit_rf %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 10)

The two most important predictors in whether a district has a median house value above or below 150000 dollars were the ocean proximity inland and the median income.

Take a look at the confusion matrix:

last_fit_rf %>%
  collect_predictions() %>% 
  conf_mat(price_category, .pred_class) %>% 
  autoplot(type = "heatmap")

Let’s create the ROC curve. Again, since the event we are predicting is the first level in the price_category factor (“above”), we provide roc_curve() with the relevant class probability .pred_above:

last_fit_rf %>% 
  collect_predictions() %>% 
  roc_curve(price_category, .pred_above) %>% 
  autoplot()

Based on all of the results, the validation set and test set performance statistics are very close, so we would have pretty high confidence that our random forest model with the selected hyperparameters would perform well when predicting new data.