Machine Learning with R-tidymodels: classification

machine learning
Published

November 19, 2021

Last week I shared some examples on Regression Models following the very good workshop on Machine Learning organized by the Rbootcamp. This week I’m continuing with Classification Models.

setup

library(tidyverse)
library(tidymodels)
library(rpart.plot)
library(patchwork)
tidymodels_prefer()

logistics regression

sample

airbnb <- read_csv(file = "data/airbnb.csv")
Rows: 1191 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): district, host_respons_time, kitchen, tv, coffe_machine, dishwashe...
dbl (14): price, accommodates, bedrooms, bathrooms, cleaning_fee, availabili...
lgl  (1): host_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
airbnb <-
  airbnb %>% 
  mutate(host_superhost = factor(host_superhost, levels = c(TRUE, FALSE)))

set.seed(123)
airbnb_split <- initial_split(airbnb, prop = .8, strata = host_superhost)
airbnb_train <- training(airbnb_split)
airbnb_test <- testing(airbnb_split)

recipe

logistic_recipe <- 
  recipe(host_superhost ~ ., data = airbnb_train) %>% 
  step_dummy(all_nominal_predictors())
logistic_recipe
Recipe

Inputs:

      role #variables
   outcome          1
 predictor         22

Operations:

Dummy variables from all_nominal_predictors()

model

logistic_model <-
  logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

translate(logistic_model)
Logistic Regression Model Specification (classification)

Computational engine: glm 

Model fit template:
stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
    family = stats::binomial)

workflow

logistic_workflow <- 
  workflow() %>% 
  add_recipe(logistic_recipe) %>% 
  add_model(logistic_model)
logistic_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)

Computational engine: glm 

fit

superhost_glm <-
  logistic_workflow %>% 
  fit(airbnb_train)

predict

logistic_pred <- 
  predict(superhost_glm, airbnb_train, type = "prob") %>% 
  bind_cols(predict(superhost_glm, airbnb_train)) %>% 
  bind_cols(airbnb_train %>% select(host_superhost))

metrics

logistic_metrics <- metrics(logistic_pred, truth = host_superhost, estimate = .pred_class, .pred_TRUE)

plot

create_model_plot <- function(prediction_data, model_metrics, title_text) {
  annotation_data <- tibble(
    x_position = 0.65,
    y_position = c(0.1,0.2,0.3,0.4),
    label_value = str_glue_data(model_metrics, "{.metric}: {round(.estimate, 2)}")
  )
  
  prediction_data %>%
    roc_curve(truth = host_superhost, .pred_TRUE) %>%
    autoplot() +
    labs(title = as.character(title_text)) +
    geom_text(
      data = annotation_data,
      mapping = aes(x = x_position, y = y_position, label = label_value),
      size = 3
    )
}
lg_plot <- create_model_plot(logistic_pred, logistic_metrics, "ROC logistic reg.")

decision tree

recipe

tree_recipe <-
  recipe(host_superhost ~ ., data = airbnb_train) %>% 
  step_other(all_nominal_predictors(), threshold = 0.005)

model

dt_model <- 
  decision_tree() %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

workflow

dt_workflow <- 
  workflow() %>% 
  add_recipe(tree_recipe) %>% 
  add_model(dt_model)

fit

superhost_dt <-
  dt_workflow %>% 
  fit(airbnb_train)

predict

dt_pred <- 
  predict(superhost_dt, airbnb_train, type = "prob") %>% 
  bind_cols(predict(superhost_dt, airbnb_train)) %>% 
  bind_cols(airbnb_train %>% select(host_superhost))

metrics

dt_metrics <- metrics(dt_pred, truth = host_superhost, estimate = .pred_class, .pred_TRUE)

plot

dt_plot <- create_model_plot(dt_pred, dt_metrics, "ROC Decision tree")

random forest

model

rf_model <- 
  rand_forest() %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

workflow

rf_workflow <- 
  workflow() %>% 
  add_recipe(tree_recipe) %>% 
  add_model(rf_model)

fit

superhost_rf <-
  rf_workflow %>% 
  fit(airbnb_train)

predict

rf_pred <- 
  predict(superhost_rf, airbnb_train, type = "prob") %>% 
  bind_cols(predict(superhost_rf, airbnb_train)) %>% 
  bind_cols(airbnb_train %>% select(host_superhost))

metrics

rf_metrics <- metrics(rf_pred, truth = host_superhost, estimate = .pred_class, .pred_TRUE)

plot

rf_plot <- create_model_plot(rf_pred, rf_metrics, "ROC Random forest")

metrics overview

lg_plot + dt_plot + rf_plot

In classification models the typical metrics are given by the following loss functions:

  • ROC auc: area under the receiver operator curve
  • accuracy: the proportion of the data that are predicted correctly
  • KAP (kappa): similar to accuracy, but normalized by the accuracy that would be expected by chance alone
  • LogLoss: alternative to MSE and MAE (compared with accuracy, the logarithmic loss takes into account the uncertainty in the prediction)

check ?metrics for details.

Additional important definitions:

  • sensitivity: of only the truly positive, what proportions are classified as positive
  • specificity: of only the truly negative, what proportion is classified as negative
  • overlap: the proportion of times the model fails in predicting correctly