This article investigates how to optimize a Concrete recipe to obtain maximal compressive strength. It provides an opportunity to test statistical learning techniques for modeling the relation between the output (Compression Strength) and various inputs (e.g. Cement, Water) and to assess the feasibility of a simple search approach for finding an optimal result.

The language used is R and several well tested libraries for data science and modelling (tidyverse and tidymodels).

This article is based on the case study Compressive Strength of Concrete from the book Applied Predictive Modeling by Max Kuhn (2021). The data for this analysis is obtained for the companion R package of the book.

Our dataset has the name concrete and we made a local copy that we now load. The outputs below provides important information on the dataset such as its size and also some statistics on the variables distributions.

Summary

concrete <-read_csv("data/concrete.csv")

Rows: 1030 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (9): cement, blast_furnace_slag, fly_ash, water, superplasticizer, coars...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

summary(concrete)

cement blast_furnace_slag fly_ash water
Min. :102.0 Min. : 0.0 Min. : 0.00 Min. :121.8
1st Qu.:192.4 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:164.9
Median :272.9 Median : 22.0 Median : 0.00 Median :185.0
Mean :281.2 Mean : 73.9 Mean : 54.19 Mean :181.6
3rd Qu.:350.0 3rd Qu.:142.9 3rd Qu.:118.30 3rd Qu.:192.0
Max. :540.0 Max. :359.4 Max. :200.10 Max. :247.0
superplasticizer coarse_aggregate fine_aggregate age
Min. : 0.000 Min. : 801.0 Min. :594.0 Min. : 1.00
1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:731.0 1st Qu.: 7.00
Median : 6.400 Median : 968.0 Median :779.5 Median : 28.00
Mean : 6.205 Mean : 972.9 Mean :773.6 Mean : 45.66
3rd Qu.:10.200 3rd Qu.:1029.4 3rd Qu.:824.0 3rd Qu.: 56.00
Max. :32.200 Max. :1145.0 Max. :992.6 Max. :365.00
compressive_strength
Min. : 2.33
1st Qu.:23.71
Median :34.45
Mean :35.82
3rd Qu.:46.13
Max. :82.60

A first step in our data treatment is to calculate the proportion of each recipe component from the total weight:

From the book also we learn that the input factor Age has a strong influence on the outcome and we assume that it has a strong impact on cost. This is also the age subgroup with the highest number of datapoints. As in the text for illustrative purposes, we’re now fixing this input factor in our study at 28 days and filtering the data accordingly:

We can now have a look at the input variables, their distributions and their relationship with each other:

ggpairs(concrete_mixture_28[,12:17])

We can observe that the correlation is low to moderate in most cases. The distribution is useful to get some expert insight into the proportions and will help later in interpreting the results from the optimization.

Modeling

Sample

As explained in the previous articles on modeling with tidymodels, we here use functions from the sample package to split the dataset into train and test datasets.

We can here check our model performance on the train dataset:

metrics_model_rf

# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 3.34
2 rsq standard 0.957
3 mae standard 2.12

The fit for the train dataset is extremely high with a value above 0.90 and the Mean Absolute Error is very low, slightly above 2. We can see in the plot below, real vs predict values for the train dataset: specially until values of 60, points are well on top of the plot diagonal.

Fit plot

predict_rf |>ggplot(aes(x = .pred, y = compressive_strength)) +geom_abline(lty =2) +geom_point(alpha =0.5) +coord_obs_pred(ratio =1) +# Scale and size the x- and y-axis uniformly:theme_light()

For the test dataset the RSQ is still above 0.70. We can now we expect this from new real data.

metrics_model_rf_test

# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 8.20
2 rsq standard 0.740
3 mae standard 5.99

Variable importance

Now that we have a model, we’re listing the ranking of explanatory variables:

To find a recipe with maximal compressive strength we’re going to prepare a table with possible combinations of the input factors exploring a space slightly beyond the original dataset.

Search space

We start by seeing the statistics of components proportions (as reminder the initial statistics presented were on absolute values).

summary(concrete_mixture_28[11:18])

water_prop cement_prop blast_prop fly_prop
Min. :0.05139 Min. :0.04482 Min. :0.00000 Min. :0.00000
1st Qu.:0.07340 1st Qu.:0.07080 1st Qu.:0.00000 1st Qu.:0.00000
Median :0.07894 Median :0.11456 Median :0.04049 Median :0.02479
Mean :0.07895 Mean :0.11362 Mean :0.03724 Mean :0.02728
3rd Qu.:0.08450 3rd Qu.:0.14010 3rd Qu.:0.06873 3rd Qu.:0.05121
Max. :0.11222 Max. :0.22541 Max. :0.15034 Max. :0.08884
superplast_prop coarse_agg_prop fine_agg_prop prop_control
Min. :0.000000 Min. :0.3459 Min. :0.2480 Min. :1
1st Qu.:0.000000 1st Qu.:0.3875 1st Qu.:0.3092 1st Qu.:1
Median :0.003302 Median :0.4132 Median :0.3297 Median :1
Mean :0.003007 Mean :0.4111 Mean :0.3288 Mean :1
3rd Qu.:0.004486 3rd Qu.:0.4354 3rd Qu.:0.3503 3rd Qu.:1
Max. :0.013149 Max. :0.4798 Max. :0.4141 Max. :1

These statistics give us insight on which ranges to vary each component:

The expand.grid function is naturally going to create combinations for which the sum of the proportions is greater than 1. From the summary statistics we could see that the water proportion can go up to 0.11 . As the water is added when all other ingredients are fixed we can use this fact to remove all recipes that don’t leave room to add water.

We add a variable to calculate the sum of the proportions:

Clearly the first values correspond to to an increase in the cement proportion which is the factor that has the higher variable importance. Interestingly, comparing with the original dataset we see we had recipes with higher compressive strength:

Looking back at the Fit Plot we see that above 60 the model is under predicting the values. Now this opens for the important discussion between using a model and interpreting it instead of just doing random trials and considering that the input factors can be generalized.

In the next plots we look at how the main influencing factor relates with the Compressive Strength both in reality and in the model.

Plot Strenght ~ Cement

predict_rf_new |>ggplot(aes(x =as_factor(cement_prop), y = .pred)) +geom_boxplot() +theme_light()

The previous plot shows that according to the model it is enough to use 0.18 for cement to maximize strength. We also see visually that the variability of this value is approximately \(\pm\) 5 which is close to the metrics calculated on the test dataset.

concrete_mixture_28 |>ggplot(aes(x = cement_prop, y = compressive_strength)) +geom_point()+theme_light()

The previous plot shows that on the original dataset increasing cement proportion leads always to an increase in strength but the variability is much higher (visually around \(\pm\) 10).

A similar analysis for the remaining input factors would certainly provide other valuable insights in the selection of the optimal recipe.

Conclusions

An optimal recipe that maximizes compression strength has been identified. This recipe corresponds to a maximization of the cement which is the input variable with the highest importance. Interestingly some recipes from the original dataset have an even higher compressive strength but the variability of the results in the compressive strength is much higher thus less reliable.

In these cases the next is step is naturally to confirm the predictions with real experiments to confirm the output and variability, finetune the model and eventually get to an improved recipe.