6 Model building

6.1 Specify models

The process of specifying our models is always as follows:

Pick a model type
set the engine
Set the mode: regression or classification

You can choose the model type and engine from this list.

6.1.1 Lasso regression

lasso_spec <- # your model specification
  linear_reg(penalty = 0.1, mixture = 1) %>%  # model type and some options
  set_engine(engine = "glmnet") %>%  # model engine
  set_mode("regression") # model mode

# Show your model specification
lasso_spec

#> Linear Regression Model Specification (regression)
#> 
#> Main Arguments:
#>   penalty = 0.1
#>   mixture = 1
#> 
#> Computational engine: glmnet

penalty: The total amount of regularization in the model. Higher values imply a higher penalty. If you choose a penalty of 0 you fit a standard linear regression model.
mixture: The mixture amounts of different types of regularization. A number between zero and one (inclusive) that is the proportion of L1 regularization (i.e. lasso) in the model. When mixture = 1, it is a pure lasso model while mixture = 0 indicates that ridge regression is being used (this works only for engines “glmnet” and “spark”).

Note that for the lasso regression to work properly it is very important to always add a data normalization step.

6.1.2 Natural spline

spline_spec <- 
  linear_reg() %>%  
  set_engine(engine = "lm") %>%  
  set_mode("regression")

To use this model correctly, we also need to add a data normalization step as well as a step to declare the degree of freedom in our model. We will include the degrees of freedom at a later step (when we create the workflows).

6.1.3 Random forest

library(ranger)

rf_spec <- 
  rand_forest() %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

6.1.4 Boosted tree (XGBoost)

library(xgboost)

xgb_spec <- 
  boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

6.1.5 K-nearest neighbor

knn_spec <- 
  nearest_neighbor(neighbors = 4) %>% # we can adjust the number of neighbors 
  set_engine("kknn") %>% 
  set_mode("regression")

6.2 Create workflows

To combine the data preparation recipe with the model building, we use the package workflows. A workflow is an object that can bundle together your pre-processing recipe, modeling, and even post-processing requests (like calculating the RMSE).

6.2.1 Lasso

Bundle recipe and model with workflows:

lasso_wflow <- # new workflow object
 workflow() %>% # use workflow function
 add_recipe(housing_rec) %>%   # use the new recipe
 add_model(lasso_spec)   # add your model spec

6.2.2 Natural spline

We need to declare the degrees of freedom -with step_ns()- for our natural spline. In our example, we just add the new step to our housing_rec recipe and create a new recipe which we will only use for ourse natural spline.

library(splines)

housing_rec_spline <- 
  housing_rec %>%  
  step_ns(all_predictors(), deg_free = 3) # natural spline

The higher the degree of freedom, the more complex the resulting model.

Now we bundle the recipe and our model:

spline_wflow <- 
 workflow() %>% 
 add_recipe(housing_rec_spline) %>%   # use the spline recipe
 add_model(spline_spec)

6.2.3 Random forest

Bundle recipe and model:

rf_wflow <-
 workflow() %>%
 add_recipe(housing_rec) %>% 
 add_model(rf_spec)

6.2.4 XGBoost

Bundle recipe and model:

xgb_wflow <-
 workflow() %>%
 add_recipe(housing_rec) %>% 
 add_model(xgb_spec)

6.2.5 K-nearest neighbor

Bundle recipe and model:

knn_wflow <-
 workflow() %>%
 add_recipe(housing_rec) %>% 
 add_model(knn_spec)

6.3 Evaluate models

Now we can use our validation set (cv_folds) to estimate the performance of our models using the fit_resamples() function to fit the models on each of the folds and store the results.

Note that fit_resamples() will fit our model to each resample and evaluate on the heldout set from each resample. The function is usually only used for computing performance metrics across some set of resamples to evaluate our models (like RMSE) - the models are not even stored. However, in our example we save the predictions in order to visualize the model fit and residuals with control_resamples(save_pred = TRUE).

Finally, we collect the performance metrics with collect_metrics() and pick the model that does best on the validation set.

6.3.1 Lasso regression

set.seed(100)

lasso_res <- 
  lasso_wflow %>% # use workflow object
  fit_resamples(resamples = cv_folds,
                control = control_resamples(save_pred = TRUE) # save predictions
    )

Show average performance over all folds:

lasso_res %>%  collect_metrics(summarize = TRUE)

#> # A tibble: 2 x 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   0.357     5 0.00119 Preprocessor1_Model1
#> 2 rsq     standard   0.608     5 0.00356 Preprocessor1_Model1

Show performance for every single fold:

lasso_res %>%  collect_metrics(summarize = FALSE)

#> # A tibble: 10 x 5
#>   id    .metric .estimator .estimate .config             
#>   <chr> <chr>   <chr>          <dbl> <chr>               
#> 1 Fold1 rmse    standard       0.356 Preprocessor1_Model1
#> 2 Fold1 rsq     standard       0.599 Preprocessor1_Model1
#> 3 Fold2 rmse    standard       0.354 Preprocessor1_Model1
#> 4 Fold2 rsq     standard       0.613 Preprocessor1_Model1
#> 5 Fold3 rmse    standard       0.361 Preprocessor1_Model1
#> 6 Fold3 rsq     standard       0.608 Preprocessor1_Model1
#> # … with 4 more rows

To assess the model predictions, we plot the predictions on the y-axis and the real median house value at the x-axis. Note that the red line is not our model. If our model would have made no mistakes at all, all points would lie on the red diagonal line (where the prediction equals the real value).

assess_res <- collect_predictions(lasso_res)

assess_res %>% 
  ggplot(aes(x = median_house_value, y = .pred)) + 
  geom_point(alpha = .15) +
  geom_abline(col = "red") + 
  coord_obs_pred() + 
  ylab("Predicted")

Let`s look at the 10 districts where our model produced the greatest residuals:

wrongest_prediction <- 
  assess_res %>% 
  mutate(residual = median_house_value - .pred) %>% 
  arrange(desc(abs(residual))) %>% 
  slice_head(n = 10)

wrongest_prediction

#> # A tibble: 10 x 6
#>   id    .pred  .row median_house_value .config              residual
#>   <chr> <dbl> <int>              <dbl> <chr>                   <dbl>
#> 1 Fold1  11.9  6523               9.62 Preprocessor1_Model1    -2.30
#> 2 Fold3  12.0  4197               9.77 Preprocessor1_Model1    -2.21
#> 3 Fold3  12.0  1348              10.0  Preprocessor1_Model1    -2.01
#> 4 Fold5  11.6  2084               9.62 Preprocessor1_Model1    -1.98
#> 5 Fold5  11.5  1868               9.62 Preprocessor1_Model1    -1.87
#> 6 Fold1  12.0 12383              10.3  Preprocessor1_Model1    -1.79
#> # … with 4 more rows

Show the observations in the training data.

train_data %>% 
  dplyr::slice(wrongest_prediction$.row)

#> # A tibble: 10 x 8
#>   longitude latitude median_house_va… median_income ocean_proximity
#>       <dbl>    <dbl>            <dbl>         <dbl> <fct>          
#> 1     -118.     34.2            14999          4.19 INLAND         
#> 2     -118.     34.2            17500          2.37 <1H OCEAN      
#> 3     -122.     37.9            22500          2.68 NEAR BAY       
#> 4     -117.     36.4            14999          2.1  INLAND         
#> 5     -123.     39.7            14999          1.66 INLAND         
#> 6     -121.     34.7            28300          2.74 NEAR OCEAN     
#> # … with 4 more rows, and 3 more variables: bedrooms_per_room <dbl>,
#> #   rooms_per_household <dbl>, population_per_household <dbl>

In this tutorial, we don’t further investigate the reasons for the wrong predictions. In reality, we would check wether some of the districts are outliers in comparision to the rest of our data and we would need to decide if we should drop some of the cases from the data (if there are good reasons to do so).

6.3.2 Natural spline

We don’t repeat all of the steps shown in lasso regression and just focus on the performance metrics.

spline_res <-
  spline_wflow %>% 
  fit_resamples(
    resamples = cv_folds,
    control = control_resamples(save_pred = TRUE)
    )

spline_res %>%  collect_metrics(summarize = TRUE)

#> # A tibble: 2 x 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   0.308     4 0.00101 Preprocessor1_Model1
#> 2 rsq     standard   0.665     4 0.00199 Preprocessor1_Model1

6.3.3 Random forest

We don’t repeat all of the steps shown in lasso regression and just focus on the performance metrics.

rf_res <-
  rf_wflow %>% 
  fit_resamples(
    resamples = cv_folds,
    control = control_resamples(save_pred = TRUE)
    )

rf_res %>%  collect_metrics(summarize = TRUE)

#> # A tibble: 2 x 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   0.309     5 0.00135 Preprocessor1_Model1
#> 2 rsq     standard   0.673     5 0.00144 Preprocessor1_Model1

6.3.4 XGBoost

We don’t repeat all of the steps shown in lasso regression and just focus on the performance metrics.

xgb_res <- 
  xgb_wflow %>% 
  fit_resamples(
    resamples = cv_folds,
    control = control_resamples(save_pred = TRUE)
    ) 

xgb_res %>% collect_metrics(summarize = TRUE)

#> # A tibble: 2 x 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   0.304     5 0.00158 Preprocessor1_Model1
#> 2 rsq     standard   0.685     5 0.00102 Preprocessor1_Model1

6.3.5 K-nearest neighbor

We don’t repeat all of the steps shown in lasso regression and just focus on the performance metrics.

knn_res <- 
  knn_wflow %>% 
  fit_resamples(
    resamples = cv_folds,
    control = control_resamples(save_pred = TRUE)
    ) 

knn_res %>% collect_metrics(summarize = TRUE)

#> # A tibble: 2 x 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   0.338     5 0.00136 Preprocessor1_Model1
#> 2 rsq     standard   0.611     5 0.00184 Preprocessor1_Model1

6.3.6 Compare models

Extract the RMSE from our models to compare them:

lasso_rmse <- 
  lasso_res %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "lasso")

spline_rmse <- 
  spline_res %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "spline")

rf_rmse <- 
  rf_res %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "random forest")

xgb_rmse <- 
  xgb_res %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "XGBoost")

knn_rmse <- 
  knn_res %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "Knn")

# create dataframe with all models
model_compare <- bind_rows(lasso_rmse,
                           spline_rmse,
                           rf_rmse,
                           xgb_rmse,
                           knn_rmse) 

# change data structure
model_comp <- 
  model_compare %>% 
  select(model, .metric, mean, std_err) %>% 
  pivot_wider(names_from = .metric, values_from = c(mean, std_err)) 

# show rmse 
model_comp %>% 
  arrange(mean_rmse) %>% 
  mutate(model = fct_reorder(model, mean_rmse)) %>%
  ggplot(aes(model, mean_rmse, fill=model)) +
  geom_col() +
  scale_fill_brewer(palette = "Blues")

# show rsq 
model_comp %>% 
  arrange(mean_rsq) %>% 
  mutate(model = fct_reorder(model, desc(mean_rsq))) %>%
  ggplot(aes(model, mean_rsq, fill=model)) +
  geom_col() +
  scale_fill_brewer(palette = "Blues")

Note that the model results are quite similar.

# find minimum rmse
model_comp %>% 
  slice_min(mean_rmse)

#> # A tibble: 1 x 5
#>   model   mean_rmse mean_rsq std_err_rmse std_err_rsq
#>   <chr>       <dbl>    <dbl>        <dbl>       <dbl>
#> 1 XGBoost     0.304    0.685      0.00158     0.00102

Now it’s time to fit the best model (in our case the XGBoost model) one last time to the full training set and evaluate the resulting final model on the test set.

6.4 Last fit and evaluation on test set

Tidymodels provides the function last_fit() which fits a model to the training data and evaluates it on the test set. We just need to provide the workflow object of the best model as well as the data split object (not the training data).

last_fit_xgb <- last_fit(xgb_wflow, split = data_split)

# Show RMSE and RSQ
last_fit_xgb %>% 
  collect_metrics()

#> # A tibble: 2 x 4
#>   .metric .estimator .estimate .config             
#>   <chr>   <chr>          <dbl> <chr>               
#> 1 rmse    standard       0.305 Preprocessor1_Model1
#> 2 rsq     standard       0.682 Preprocessor1_Model1

And this is our final result. Remember that if a model fit to the training dataset also fits the test dataset well, minimal overfitting has taken place. This seems to be also the case in our example.

5 Data preparation

7 Introduction