# Machine Learning with R-tidymodels: recipe optimization

machine learning
Published

June 26, 2022

## Introduction

This article investigates how to optimize a Concrete recipe to obtain maximal compressive strength. It provides an opportunity to test statistical learning techniques for modeling the relation between the output (Compression Strength) and various inputs (e.g. Cement, Water) and to assess the feasibility of a simple search approach for finding an optimal result.

The language used is R and several well tested libraries for data science and modelling (tidyverse and tidymodels).

This article is based on the case study Compressive Strength of Concrete from the book Applied Predictive Modeling by Max Kuhn (2021). The data for this analysis is obtained for the companion R package of the book.

## Methods

### Setup

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.5     ✔ purrr   0.3.4
✔ tibble  3.1.6     ✔ dplyr   1.0.8
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(tidymodels)
Registered S3 method overwritten by 'tune':
method                   from
required_pkgs.model_spec parsnip
── Attaching packages ────────────────────────────────────── tidymodels 0.1.4 ──
✔ broom        0.8.0      ✔ rsample      0.1.0
✔ dials        0.0.10     ✔ tune         0.1.6
✔ infer        1.0.0      ✔ workflows    0.2.4
✔ modeldata    0.1.1      ✔ workflowsets 0.1.0
✔ parsnip      0.1.7      ✔ yardstick    0.0.8
✔ recipes      0.1.17
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ recipes::step()   masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages
library(vip)

Attaching package: 'vip'

The following object is masked from 'package:utils':

vi
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg   ggplot2
tidymodels_prefer()
doParallel::registerDoParallel()

### Data

Our dataset has the name concrete and we made a local copy that we now load. The outputs below provides important information on the dataset such as its size and also some statistics on the variables distributions.

#### Summary

concrete <- read_csv("data/concrete.csv")
Rows: 1030 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (9): cement, blast_furnace_slag, fly_ash, water, superplasticizer, coars...

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.
summary(concrete)
     cement      blast_furnace_slag    fly_ash           water
Min.   :102.0   Min.   :  0.0      Min.   :  0.00   Min.   :121.8
1st Qu.:192.4   1st Qu.:  0.0      1st Qu.:  0.00   1st Qu.:164.9
Median :272.9   Median : 22.0      Median :  0.00   Median :185.0
Mean   :281.2   Mean   : 73.9      Mean   : 54.19   Mean   :181.6
3rd Qu.:350.0   3rd Qu.:142.9      3rd Qu.:118.30   3rd Qu.:192.0
Max.   :540.0   Max.   :359.4      Max.   :200.10   Max.   :247.0
superplasticizer coarse_aggregate fine_aggregate       age
Min.   : 0.000   Min.   : 801.0   Min.   :594.0   Min.   :  1.00
1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:731.0   1st Qu.:  7.00
Median : 6.400   Median : 968.0   Median :779.5   Median : 28.00
Mean   : 6.205   Mean   : 972.9   Mean   :773.6   Mean   : 45.66
3rd Qu.:10.200   3rd Qu.:1029.4   3rd Qu.:824.0   3rd Qu.: 56.00
Max.   :32.200   Max.   :1145.0   Max.   :992.6   Max.   :365.00
compressive_strength
Min.   : 2.33
1st Qu.:23.71
Median :34.45
Mean   :35.82
3rd Qu.:46.13
Max.   :82.60       

A first step in our data treatment is to calculate the proportion of each recipe component from the total weight:

concrete_mixture <- concrete |>
mutate(total_weight = water + cement + blast_furnace_slag + fly_ash + superplasticizer + coarse_aggregate + fine_aggregate) |>
mutate(water_prop = water / total_weight,
cement_prop = cement / total_weight,
blast_prop = blast_furnace_slag / total_weight,
fly_prop = fly_ash / total_weight,
superplast_prop = superplasticizer / total_weight,
coarse_agg_prop = coarse_aggregate / total_weight,
fine_agg_prop = fine_aggregate / total_weight)

A check can be done to confirm that ingredients always add up to one:

concrete_mixture <- concrete_mixture |>
mutate(prop_control = water_prop + cement_prop + blast_prop + fly_prop + superplast_prop + coarse_agg_prop + fine_agg_prop)

prop_control <- unique(concrete_mixture$prop_control) |> round(8) prop_control [1] 1 1 1 1 From the book also we learn that the input factor Age has a strong influence on the outcome and we assume that it has a strong impact on cost. This is also the age subgroup with the highest number of datapoints. As in the text for illustrative purposes, we’re now fixing this input factor in our study at 28 days and filtering the data accordingly: concrete_mixture_28 <- concrete_mixture %>% filter(age == 28) #### Correlation plot We can now have a look at the input variables, their distributions and their relationship with each other: ggpairs(concrete_mixture_28[,12:17]) We can observe that the correlation is low to moderate in most cases. The distribution is useful to get some expert insight into the proportions and will help later in interpreting the results from the optimization. ### Modeling #### Sample As explained in the previous articles on modeling with tidymodels, we here use functions from the sample package to split the dataset into train and test datasets. set.seed(123) concrete_split <- initial_split(concrete_mixture_28, prop = 0.8) concrete_train <- training(concrete_split) concrete_test <- testing(concrete_split) #### Recipe In our recipe we’re adding all Cement recipe components but not the age as decided before: tree_recipe <- recipe( formula = compressive_strength ~ cement_prop + blast_prop + fly_prop + superplast_prop + coarse_agg_prop + fine_agg_prop, concrete_train) The next steps follow also the tidymodels approach. #### Model model_rf <- rand_forest() |> set_engine('ranger', importance = "impurity") |> set_mode('regression') #### Workflow rf_workflow <- workflow() |> add_recipe(tree_recipe) |> add_model(model_rf) #### Fit fit_rf <- rf_workflow |> fit(data = concrete_train) fit_rf_test <- rf_workflow |> fit(data = concrete_test) #### Predict predict_rf <- fit_rf |> predict(new_data = concrete_train) |> bind_cols(concrete_train) predict_rf_test <- fit_rf |> predict(new_data = concrete_test) |> bind_cols(concrete_test) #### Metrics metrics_model_rf <- predict_rf |> metrics(truth = compressive_strength, estimate = .pred) metrics_model_rf_test <- predict_rf_test |> metrics(truth = compressive_strength, estimate = .pred) We can here check our model performance on the train dataset: metrics_model_rf # A tibble: 3 × 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 rmse standard 3.34 2 rsq standard 0.957 3 mae standard 2.12  The fit for the train dataset is extremely high with a value above 0.90 and the Mean Absolute Error is very low, slightly above 2. We can see in the plot below, real vs predict values for the train dataset: specially until values of 60, points are well on top of the plot diagonal. #### Fit plot predict_rf |> ggplot(aes(x = .pred, y = compressive_strength)) + geom_abline(lty = 2) + geom_point(alpha = 0.5) + coord_obs_pred(ratio = 1) + # Scale and size the x- and y-axis uniformly: theme_light() For the test dataset the RSQ is still above 0.70. We can now we expect this from new real data. metrics_model_rf_test # A tibble: 3 × 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 rmse standard 8.20 2 rsq standard 0.740 3 mae standard 5.99  #### Variable importance Now that we have a model, we’re listing the ranking of explanatory variables: fit_rf |> extract_fit_parsnip() |> vip::vi(num_features = 20) # A tibble: 6 × 2 Variable Importance <chr> <dbl> 1 cement_prop 26236. 2 coarse_agg_prop 11423. 3 fine_agg_prop 8929. 4 blast_prop 8019. 5 fly_prop 7822. 6 superplast_prop 7328. ### Optimizing To find a recipe with maximal compressive strength we’re going to prepare a table with possible combinations of the input factors exploring a space slightly beyond the original dataset. #### Search space We start by seeing the statistics of components proportions (as reminder the initial statistics presented were on absolute values). summary(concrete_mixture_28[11:18])  water_prop cement_prop blast_prop fly_prop Min. :0.05139 Min. :0.04482 Min. :0.00000 Min. :0.00000 1st Qu.:0.07340 1st Qu.:0.07080 1st Qu.:0.00000 1st Qu.:0.00000 Median :0.07894 Median :0.11456 Median :0.04049 Median :0.02479 Mean :0.07895 Mean :0.11362 Mean :0.03724 Mean :0.02728 3rd Qu.:0.08450 3rd Qu.:0.14010 3rd Qu.:0.06873 3rd Qu.:0.05121 Max. :0.11222 Max. :0.22541 Max. :0.15034 Max. :0.08884 superplast_prop coarse_agg_prop fine_agg_prop prop_control Min. :0.000000 Min. :0.3459 Min. :0.2480 Min. :1 1st Qu.:0.000000 1st Qu.:0.3875 1st Qu.:0.3092 1st Qu.:1 Median :0.003302 Median :0.4132 Median :0.3297 Median :1 Mean :0.003007 Mean :0.4111 Mean :0.3288 Mean :1 3rd Qu.:0.004486 3rd Qu.:0.4354 3rd Qu.:0.3503 3rd Qu.:1 Max. :0.013149 Max. :0.4798 Max. :0.4141 Max. :1  These statistics give us insight on which ranges to vary each component: concrete_new <- expand.grid( cement_prop = seq(0.04, 0.4, 0.02), superplast_prop = seq(0, 0.02, 0.01), fine_agg_prop = seq(0.1, 0.5, 0.1), coarse_agg_prop = seq(0.2, 1, 0.1), blast_prop = seq(0, 0.3, 0.05), fly_prop = seq(0, 1, 0.1)) concrete_new$age <- 28

The expand.grid function is naturally going to create combinations for which the sum of the proportions is greater than 1. From the summary statistics we could see that the water proportion can go up to 0.11 . As the water is added when all other ingredients are fixed we can use this fact to remove all recipes that don’t leave room to add water.

We add a variable to calculate the sum of the proportions:

concrete_new <- concrete_new |>
mutate(total_prop = cement_prop + superplast_prop + fine_agg_prop + coarse_agg_prop + blast_prop + fly_prop)

and we retain only combinations that are lower that 1 - 0.15 in order to leave 15% for water (a bit more than the current max of water):

concrete_optim <- concrete_new %>%
filter(total_prop < 1 - 0.15)

#### Predict new

Now we feed the Search space dataset to our predictive function and calculate Compressive strengths for these potential recipes.

predict_rf_new <-
fit_rf |>
predict(new_data = concrete_optim) |>
bind_cols(concrete_optim)

## Results

### Find optimal

Our optimal recipe with this approach is the one with the highest prediction which we obtain by sorting the prediction output table:

predict_rf_new |>
arrange(desc(.pred)) |>
select(.pred, cement_prop, coarse_agg_prop, fine_agg_prop, blast_prop, fly_prop, superplast_prop) |>
head(5)
# A tibble: 5 × 7
.pred cement_prop coarse_agg_prop fine_agg_prop blast_prop fly_prop
<dbl>       <dbl>           <dbl>         <dbl>      <dbl>    <dbl>
1  63.8        0.18             0.2           0.3       0.15        0
2  63.7        0.16             0.2           0.3       0.15        0
3  63.7        0.16             0.2           0.3       0.15        0
4  63.7        0.18             0.2           0.3       0.1         0
5  63.6        0.18             0.2           0.3       0.1         0
# … with 1 more variable: superplast_prop <dbl>

Clearly the first values correspond to to an increase in the cement proportion which is the factor that has the higher variable importance. Interestingly, comparing with the original dataset we see we had recipes with higher compressive strength:

concrete_mixture_28 |>
arrange(desc(compressive_strength)) |>
select(compressive_strength, cement_prop, coarse_agg_prop, fine_agg_prop, blast_prop, fly_prop, superplast_prop) |>
head(5)
# A tibble: 5 × 7
compressive_str… cement_prop coarse_agg_prop fine_agg_prop blast_prop fly_prop
<dbl>       <dbl>           <dbl>         <dbl>      <dbl>    <dbl>
1             81.8       0.127           0.456         0.301     0.0553   0
2             80.0       0.223           0.430         0.279     0        0
3             78.8       0.188           0.429         0.310     0        0
4             76.2       0.117           0.354         0.327     0.0768   0.0512
5             75.0       0.212           0.364         0.364     0        0
# … with 1 more variable: superplast_prop <dbl>

Looking back at the Fit Plot we see that above 60 the model is under predicting the values. Now this opens for the important discussion between using a model and interpreting it instead of just doing random trials and considering that the input factors can be generalized.

In the next plots we look at how the main influencing factor relates with the Compressive Strength both in reality and in the model.

### Plot Strenght ~ Cement

predict_rf_new |>
ggplot(aes(x = as_factor(cement_prop), y = .pred)) +
geom_boxplot() +
theme_light()

The previous plot shows that according to the model it is enough to use 0.18 for cement to maximize strength. We also see visually that the variability of this value is approximately $$\pm$$ 5 which is close to the metrics calculated on the test dataset.

concrete_mixture_28 |>
ggplot(aes(x = cement_prop, y = compressive_strength)) +
geom_point()+
theme_light()

The previous plot shows that on the original dataset increasing cement proportion leads always to an increase in strength but the variability is much higher (visually around $$\pm$$ 10).

A similar analysis for the remaining input factors would certainly provide other valuable insights in the selection of the optimal recipe.

## Conclusions

An optimal recipe that maximizes compression strength has been identified. This recipe corresponds to a maximization of the cement which is the input variable with the highest importance. Interestingly some recipes from the original dataset have an even higher compressive strength but the variability of the results in the compressive strength is much higher thus less reliable.

In these cases the next is step is naturally to confirm the predictions with real experiments to confirm the output and variability, finetune the model and eventually get to an improved recipe.

## References

Max Kuhn, Kjell Johnson. 2021. Applied Predictive Modeling. 2nd ed. Springer. appliedpredictivemodeling.com.