Machine Learning with R-tidymodels: regression

machine learning
Published

November 12, 2021

In a previous article I’ve presented the tidymodels framework and the pipeline I’ve adopted to apply predictive models to my datasets.

Below I’m sharing here one interesting case that was presented in the RBootcamp training, that follows these steps and applies three different models in a regression problem.

setup

library(tidyverse)
library(tidymodels)
library(rpart.plot)
library(patchwork)
tidymodels_prefer()

linear regression

sample

airbnb <- read_csv(file = "data/airbnb.csv")
Rows: 1191 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): district, host_respons_time, kitchen, tv, coffe_machine, dishwashe...
dbl (14): price, accommodates, bedrooms, bathrooms, cleaning_fee, availabili...
lgl  (1): host_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(123)
airbnb_split <- initial_split(airbnb, prop = 0.8)
airbnb_train <- training(airbnb_split)
airbnb_test <- testing(airbnb_split)

recipe

airbnb_recipe <- recipe(
  formula = price ~ accommodates,
  airbnb
  )
airbnb_recipe
Recipe

Inputs:

      role #variables
   outcome          1
 predictor          1

model

lm_model <-
  linear_reg() %>%
  set_engine('lm') %>% 
  set_mode("regression")

translate(lm_model)
Linear Regression Model Specification (regression)

Computational engine: lm 

Model fit template:
stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())

workflow

lm_workflow <-
  workflow() %>% 
  add_recipe(airbnb_recipe) %>% 
  add_model(lm_model)

lm_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm 

fit

price_lm <-
  lm_workflow %>% 
  fit(airbnb)

tidy(price_lm)
# A tibble: 2 × 5
  term         estimate std.error statistic   p.value
  <chr>           <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     -13.5      4.04     -3.34 8.60e-  4
2 accommodates     27.6      1.12     24.6  1.15e-108

predict

lm_pred <-
  price_lm %>% 
  predict(new_data = airbnb) %>% 
  bind_cols(airbnb) # %>% select(price))
  
lm_pred
# A tibble: 1,191 × 24
   .pred price accommodates bedrooms bathrooms cleaning_fee availability_90_days
   <dbl> <dbl>        <dbl>    <dbl>     <dbl>        <dbl>                <dbl>
 1  69.2    99            3        1       2             30                    3
 2  96.7    61            4        1       1             35                    0
 3  41.6    50            2        2       0.5            0                    0
 4  41.6    30            2        1       1.5           25                   31
 5  41.6    60            2        1       1             40                   87
 6  41.6    45            2        1       1.5           10                    0
 7  41.6    32            2        0       1              0                   14
 8 152.     62            6        2       1             30                   76
 9  41.6    50            2        1       1              0                   53
10  41.6    60            2        1       1              0                   16
# … with 1,181 more rows, and 17 more variables: district <chr>,
#   host_respons_time <chr>, host_response_rate <dbl>, host_superhost <lgl>,
#   host_listings_count <dbl>, review_scores_accuracy <dbl>,
#   review_scores_cleanliness <dbl>, review_scores_checkin <dbl>,
#   review_scores_communication <dbl>, review_scores_location <dbl>,
#   review_scores_value <dbl>, kitchen <chr>, tv <chr>, coffe_machine <chr>,
#   dishwasher <chr>, terrace <chr>, bathtub <chr>

metrics

metrics_model_lm <- lm_pred %>% 
  metrics(truth = price, estimate = .pred)

plot y ~ x

lm_pred %>% 
  ggplot(aes(y = price, x = accommodates)) +
  geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'

plot true vs predicted

Here we create a function to generate a true vs predicted plot that we will use in each model and summarize at the end.

create_model_plot <- function(prediction_data, model_metrics, title_text) {
  annotation_data <- tibble(
    x_position = 100,
    y_position = c(400, 450, 500),
    label_value = str_glue_data(model_metrics, "{.metric}: {round(.estimate, 2)}")
  )
  
  prediction_data %>%
    ggplot(aes(x = .pred, y = price)) +
    geom_abline(lty = 2) +
    geom_point(alpha = 0.5) +
    geom_text(
      data = annotation_data,
      mapping = aes(x = x_position, y = y_position, label = label_value),
      size = 3
    ) +
    labs(
      title = as.character(title_text),
      caption = "Line = perfect performance",
      x = "Predicted Prices in $",
      y = "True Prices in $"
    ) +
    coord_obs_pred(ratio = 1) + # Scale and size the x- and y-axis uniformly:
    coord_cartesian(x = c(0, 500), y = c(0, 500)) +
    theme_light()
} 
lm_plot <- create_model_plot(lm_pred, metrics_model_lm, "Linear Regression")
Coordinate system already present. Adding new coordinate system, which will replace the existing one.

linear regression (update)

recipe

airbnb_recipe <- recipe(
  formula = price ~ .,
  airbnb_train
  ) %>% 
  step_dummy(all_nominal_predictors())
airbnb_recipe
Recipe

Inputs:

      role #variables
   outcome          1
 predictor         22

Operations:

Dummy variables from all_nominal_predictors()

workflow

lm_workflow <-
  lm_workflow %>% 
  update_recipe(airbnb_recipe)

lm_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm 

fit

price_lm <-
  lm_workflow %>% 
  fit(airbnb_train)

predict

lm_pred <-
  price_lm %>% 
  predict(new_data = airbnb_train) %>% 
  bind_cols(airbnb_train %>% select(price))

metrics

metrics_model_lm2 <- lm_pred %>% 
  metrics(truth = price, estimate = .pred)

plot

lm_plot2 <- create_model_plot(lm_pred, metrics_model_lm2, "Linear Regression 2")
Coordinate system already present. Adding new coordinate system, which will replace the existing one.

decision tree

recipe

tree_recipe <- recipe(
  formula = price ~ .,
  airbnb_train
  ) %>% 
  step_other(all_nominal_predictors(), threshold = 0.005)
tree_recipe
Recipe

Inputs:

      role #variables
   outcome          1
 predictor         22

Operations:

Collapsing factor levels for all_nominal_predictors()

model

dt_model <-
  decision_tree() %>%
  set_engine('rpart') %>%
  set_mode('regression')

dt_model
Decision Tree Model Specification (regression)

Computational engine: rpart 
translate(dt_model)
Decision Tree Model Specification (regression)

Computational engine: rpart 

Model fit template:
rpart::rpart(formula = missing_arg(), data = missing_arg(), weights = missing_arg())

workflow

dt_workflow <-
  workflow() %>% 
  add_recipe(tree_recipe) %>% 
  add_model(dt_model)

dt_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_other()

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (regression)

Computational engine: rpart 

fit

price_dt <- 
  dt_workflow %>% 
  fit(data = airbnb_train)

price_dt %>% 
  extract_fit_parsnip() %>% 
  pluck("fit")
n= 952 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 952 9279565.0  71.40336  
   2) accommodates< 11.5 942 1939145.0  65.97665  
     4) bedrooms< 1.5 759  649554.8  53.89592  
       8) cleaning_fee< 35.5 589  315195.7  47.13243 *
       9) cleaning_fee>=35.5 170  214063.6  77.32941 *
     5) bedrooms>=1.5 183  719389.8 116.08200  
      10) cleaning_fee< 49.5 87  116223.1  88.41379 *
      11) cleaning_fee>=49.5 96  476208.7 141.15620 *
   3) accommodates>=11.5 10 4699458.0 582.60000 *

tree plot

price_dt %>% 
  extract_fit_parsnip() %>% 
  pluck("fit") %>% 
  rpart.plot()
Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
To silence this warning:
    Call rpart.plot with roundint=FALSE,
    or rebuild the rpart model with model=TRUE.

The number of classes that can be predicted is equal to the number of nodes + 1. In this case the tree that has 4 nodes can predict 5 buckets.

dt_pred <-
  price_dt %>% 
  predict(new_data = airbnb_train) %>% 
  bind_cols(airbnb_train %>% select(price))
metrics_model_dt <- dt_pred %>% 
  metrics(truth = price, estimate = .pred)
dt_plot <- create_model_plot(dt_pred, metrics_model_dt, "Decision tree")
Coordinate system already present. Adding new coordinate system, which will replace the existing one.

random forest

Random forests are a collection of decision trees. They have each a small number of predictors and are not prunned. When all the trees are grown (default 500) the average of the predictions are averaged.

model

rf_model <-
  rand_forest() %>%
  set_engine('ranger', importance = "impurity") %>%
  set_mode('regression')

translate(rf_model)
Random Forest Model Specification (regression)

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger 

Model fit template:
ranger::ranger(x = missing_arg(), y = missing_arg(), case.weights = missing_arg(), 
    importance = "impurity", num.threads = 1, verbose = FALSE, 
    seed = sample.int(10^5, 1))

workflow

rf_workflow <-
  workflow() %>% 
  add_recipe(tree_recipe) %>% 
  add_model(rf_model)

fit

price_rf <- 
  rf_workflow %>% 
  fit(data = airbnb_train)
price_rf %>% 
  extract_fit_parsnip() %>% 
  pluck("fit")
Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, importance = ~"impurity",      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      952 
Number of independent variables:  22 
Mtry:                             4 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       5200.677 
R squared (OOB):                  0.4670177 

predict

rf_pred <-
  price_rf %>% 
  predict(new_data = airbnb_train) %>% 
  bind_cols(airbnb_train %>% select(price))

metrics

metrics_model_rf <- rf_pred %>% 
  metrics(truth = price, estimate = .pred)

plot

rf_plot <- create_model_plot(rf_pred, metrics_model_rf, "Random forest")
Coordinate system already present. Adding new coordinate system, which will replace the existing one.

metrics overview

In supervised models there’s a trade off to be made between transparency and performance. As we will see in this example the performance of the Random Forest model is higher but this comes at the cost of transparency. Regression and decision trees are transparent but Random Forests are not transparent

(lm_plot + lm_plot2) / (dt_plot + rf_plot)

The metrics are usually give by loss functions, giving the sum of the errors committed by the model. Typical metrics in regression models are:

  • Mean Squared Error (MSE): average squared distance between predictions and true values
  • Mean Absolute Error (MAE): average absolute distance between prediction and true values
  • R-squared: the ratio between the residuals variance and the output variable variance

The MSE weights stronger the points that are further away.

Note that in general the performance of training data is deceiving. Performance on training data can always be improved by more sophisticated models. If the model is to complex it will start fitting to the noise that is in the data. In this examples we measured performance with training data but in further cases this is done with test data obtained by the initial split.

Variable intercorrelation has not been assessed here. This is less of a concern in machine learning. In statistics it is important because it increases the error and make the assessment of the significance more difficult for each factor.