Machine Learning with R-tidymodels: overview

machine learning
Published

November 5, 2021

introduction

I’ve just attended a very good workshop on Machine Learning organized by the Rbootcamp. The two instructors Dirk Wulff and Markus Steiner have proven very knowledgeable in R and managed to keep the audience interested all along the course.

The course started with a short review of ML and then quickly dived into the practical details. Most explanations and exercises were based the application of the R framework tidymodels to predict house prices in a case study based on AirBnB data. Specifically these covered Supervised Learning approaches with examples of the topics below:

  • regression: linear regression, decision tree, random forest
  • classification: logistics regression, decision tree, random forest
  • model assessment on training and test datasets
  • regression metrics: rmse, rsq, mae
  • classification metrics: accuracy, kappa, log loss, roc auc
  • plotting: regression, trees, ROC curve

For clear explanations and details on the concepts it is good to go through the excellent book Introduction to Statistical Learning by Gareth James (2021)

workflow

The tidymodel framework helps a lot structuring the work. On this basis, I’ve prepared for myself a pipeline with the following steps:

sample > recipe > model > workflow > tune > fit > predict > metrics > plot

packages and functions

This pipeline can be further detailed by listing the packages and functions associated:

  • sample
    • {rsample}
      • initial_split()
      • training()
      • testing()
      • vfold_cv()
      • bootstraps()
  • recipe
    • {recipes}
      • recipe()
      • step_dummy()
  • model
    • {parsnip}
      • linear_reg()
      • rand_forest()
      • set_engine()
      • show_engines()
      • set_mode()
      • translate()
  • workflow
    • {workflows}
      • workflow()
      • add_recipe()
      • add_model()
  • tune
    • {dials}
      • grid_regular()
      • mixture()
      • penalty()
    • {tune}
      • tune grid()
      • fit_resamples()
      • collect_metrics()
      • select_best()
      • finalize_workflow()
  • fit
    • {parsnip}
      • fit()
    • {tune}
      • last_fit()
  • predict
    • {stats}
      • predict()
  • metrics
    • {yardstick}
      • metrics()
      • tidy()
  • plot
    • {ggplot2}
      • ggplot()
      • autoplot()

model tuning

A very important phase is of course the model tuning which can be done with a big variety of models which have different engines and tuning parameters. A first summary in the table below:

model function engine tuning parameters
linear regression linear_reg lm mixture, penalty
ridge regression linear_reg glmnet mixture, penalty
lasso regression linear_reg glmnet mixture, penalty
decision tree decision_tree rpart cost_complexity
random forest rand_forest ranger mtry

definitions:

penalty = lambda

small mtry = diverse forest, large mtry = similar forest

next steps

It may seem a lot of functions to learn and memorize but in the end it falls in place quickly because the sequence is very logic. I’m now rather convinced of its convenience by providing an uniform interface to feed parameters to the models and extract metrics and predictions. It also makes it very easy to test different models and parameters on the same dataset by simply modifying, copying or updating model objects.

In coming articles I will explore some of the the case studies presented and concrete applications of these packages and functions, discussing potential applications in product development and manufacturing.

References

References

Gareth James, Trevor Hastle, Daniel Witten. 2021. Statistical Learning with r. 2nd ed. Springer. www.statlearning.com.