ML with Scikit Learn: decision tree regression

machine learning
Published

January 16, 2023

Introduction

Setup

library(reticulate)
library(here)
here() starts at /home/joao/JR-IA
py_discover_config()
python:         /home/joao/JR-IA/renv/python/condaenvs/renv-python/bin/python
libpython:      /home/joao/JR-IA/renv/python/condaenvs/renv-python/lib/libpython3.7m.so
pythonhome:     /home/joao/JR-IA/renv/python/condaenvs/renv-python:/home/joao/JR-IA/renv/python/condaenvs/renv-python
version:        3.7.13 (default, Mar 29 2022, 02:18:16)  [GCC 7.5.0]
numpy:          /home/joao/JR-IA/renv/python/condaenvs/renv-python/lib/python3.7/site-packages/numpy
numpy_version:  1.21.6

NOTE: Python version was forced by RETICULATE_PYTHON
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

Data

diabetes_df = pd.read_csv("data/diabetes_clean.csv")
print(diabetes_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pregnancies  768 non-null    int64  
 1   glucose      768 non-null    int64  
 2   diastolic    768 non-null    int64  
 3   triceps      768 non-null    int64  
 4   insulin      768 non-null    int64  
 5   bmi          768 non-null    float64
 6   dpf          768 non-null    float64
 7   age          768 non-null    int64  
 8   diabetes     768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None
print(diabetes_df.head())
   pregnancies  glucose  diastolic  triceps  ...   bmi    dpf  age  diabetes
0            6      148         72       35  ...  33.6  0.627   50         1
1            1       85         66       29  ...  26.6  0.351   31         0
2            8      183         64        0  ...  23.3  0.672   32         1
3            1       89         66       23  ...  28.1  0.167   21         0
4            0      137         40       35  ...  43.1  2.288   33         1

[5 rows x 9 columns]
X = diabetes_df.drop("glucose", axis=1).values
y = diabetes_df["glucose"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Model

dt = DecisionTreeRegressor(max_depth=8, min_samples_leaf=0.05, random_state=3)
dt.fit(X_train, y_train)
DecisionTreeRegressor(max_depth=8, min_samples_leaf=0.05, random_state=3)
y_pred = dt.predict(X_test)

Metrics

dt_r2 = dt.score(X_test, y_test).round(2)
dt_rmse = mean_squared_error(y_test, y_pred, squared=False)
print("R2:", dt_r2, ", RMSE:", dt_rmse)
R2: 0.25 , RMSE: 26.8924579416571

Fit plot

print(y_test.shape, y_pred.shape)
(231,) (231,)
dict_lmplot = {'y_test' : y_test, 'y_pred' : y_pred}
data_lmplot = pd.DataFrame(data=dict_lmplot)
plt.clf()
sns.set_palette("flare")
g = sns.lmplot(x = 'y_test', y = 'y_pred', data = data_lmplot)
g.set(xlabel="Glucose (real)", ylabel="Glucose (predicted)")
g.fig.suptitle("Fit plot of glucose (decision tree)")
plt.tight_layout()
plt.show()

plt.savefig("posts/20230116/glucose_fit_dt.png")

Compare models

¨

Model R2 RMSE
lm 0.28 26.3
lasso 0.29 26.3
ridge 0.30 25.8
tree 0.25 26.9