Machine Learning Workflow: Random Forest for Classification (Heart Data)

Biostat 203B

Author

Dr. Hua Zhou @ UCLA

Published

March 10, 2023

Display system information for reproducibility.

sessionInfo()

R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.1 compiler_4.2.2    fastmap_1.1.0     cli_3.6.0        
 [5] tools_4.2.2       htmltools_0.5.4   rstudioapi_0.14   yaml_2.3.7       
 [9] rmarkdown_2.20    knitr_1.42        xfun_0.37         digest_0.6.31    
[13] jsonlite_1.8.4    rlang_1.0.6       evaluate_0.20

import IPython
print(IPython.sys_info())

{'commit_hash': 'add5877a4',
 'commit_source': 'installation',
 'default_encoding': 'utf-8',
 'ipython_path': '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/IPython',
 'ipython_version': '8.8.0',
 'os_name': 'posix',
 'platform': 'macOS-10.16-x86_64-i386-64bit',
 'sys_executable': '/Library/Frameworks/Python.framework/Versions/3.10/bin/python3',
 'sys_platform': 'darwin',
 'sys_version': '3.10.9 (v3.10.9:1dd9be6584, Dec  6 2022, 14:37:36) [Clang '
                '13.0.0 (clang-1300.0.29.30)]'}

1 Overview

We illustrate the typical machine learning workflow for random forest using the Heart data set. The outcome is AHD (Yes or No).

Initial splitting to test and non-test sets.
Pre-processing of data: dummy coding categorical variables, standardizing numerical variables, imputing missing values, …
Tune the random forest using 5-fold cross-validation (CV) on the non-test data.
Choose the best model by CV and refit it on the whole non-test data.
Final classification on the test data.

2 Heart data

The goal is to predict the binary outcome AHD (Yes or No) of patients.

# Load libraries
library(GGally)
library(gtsummary)
library(ranger)
library(tidyverse)
library(tidymodels)

# Load the `Heart.csv` data.
Heart <- read_csv("Heart.csv") %>% 
  # first column is patient ID, which we don't need
  select(-1) %>%
  # RestECG is categorical with value 0, 1, 2
  mutate(RestECG = as.factor(RestECG)) %>%
  print(width = Inf)

# A tibble: 303 × 14
     Age   Sex ChestPain    RestBP  Chol   Fbs RestECG MaxHR ExAng Oldpeak Slope
   <dbl> <dbl> <chr>         <dbl> <dbl> <dbl> <fct>   <dbl> <dbl>   <dbl> <dbl>
 1    63     1 typical         145   233     1 2         150     0     2.3     3
 2    67     1 asymptomatic    160   286     0 2         108     1     1.5     2
 3    67     1 asymptomatic    120   229     0 2         129     1     2.6     2
 4    37     1 nonanginal      130   250     0 0         187     0     3.5     3
 5    41     0 nontypical      130   204     0 2         172     0     1.4     1
 6    56     1 nontypical      120   236     0 0         178     0     0.8     1
 7    62     0 asymptomatic    140   268     0 2         160     0     3.6     3
 8    57     0 asymptomatic    120   354     0 0         163     1     0.6     1
 9    63     1 asymptomatic    130   254     0 2         147     0     1.4     2
10    53     1 asymptomatic    140   203     1 2         155     1     3.1     3
      Ca Thal       AHD  
   <dbl> <chr>      <chr>
 1     0 fixed      No   
 2     3 normal     Yes  
 3     2 reversable Yes  
 4     0 normal     No   
 5     0 normal     No   
 6     0 normal     No   
 7     2 normal     Yes  
 8     0 normal     No   
 9     1 reversable Yes  
10     0 reversable Yes  
# … with 293 more rows

# Numerical summaries stratified by the outcome `AHD`.
Heart %>% tbl_summary(by = AHD)

Characteristic	No, N = 164¹	Yes, N = 139¹
Age	52 (45, 59)	58 (52, 62)
Sex	92 (56%)	114 (82%)
ChestPain
asymptomatic	39 (24%)	105 (76%)
nonanginal	68 (41%)	18 (13%)
nontypical	41 (25%)	9 (6.5%)
typical	16 (9.8%)	7 (5.0%)
RestBP	130 (120, 140)	130 (120, 145)
Chol	234 (209, 267)	249 (218, 284)
Fbs	23 (14%)	22 (16%)
RestECG
0	95 (58%)	56 (40%)
1	1 (0.6%)	3 (2.2%)
2	68 (41%)	80 (58%)
MaxHR	161 (149, 172)	142 (125, 156)
ExAng	23 (14%)	76 (55%)
Oldpeak	0.20 (0.00, 1.03)	1.40 (0.55, 2.50)
Slope
1	106 (65%)	36 (26%)
2	49 (30%)	91 (65%)
3	9 (5.5%)	12 (8.6%)
Ca
0	130 (81%)	46 (33%)
1	21 (13%)	44 (32%)
2	7 (4.3%)	31 (22%)
3	3 (1.9%)	17 (12%)
Unknown	3	1
Thal
fixed	6 (3.7%)	12 (8.7%)
normal	129 (79%)	37 (27%)
reversable	28 (17%)	89 (64%)
Unknown	1	1
¹ Median (IQR); n (%)

# Graphical summary:
# Heart %>% ggpairs()

# Load the pandas library
import pandas as pd
# Load numpy for array manipulation
import numpy as np
# Load seaborn plotting library
import seaborn as sns
import matplotlib.pyplot as plt

# Set font sizes in plots
sns.set(font_scale = 1.2)
# Display all columns
pd.set_option('display.max_columns', None)

Heart = pd.read_csv("Heart.csv")
Heart

     Unnamed: 0  Age  Sex     ChestPain  RestBP  Chol  Fbs  RestECG  MaxHR  \
0             1   63    1       typical     145   233    1        2    150   
1             2   67    1  asymptomatic     160   286    0        2    108   
2             3   67    1  asymptomatic     120   229    0        2    129   
3             4   37    1    nonanginal     130   250    0        0    187   
4             5   41    0    nontypical     130   204    0        2    172   
..          ...  ...  ...           ...     ...   ...  ...      ...    ...   
298         299   45    1       typical     110   264    0        0    132   
299         300   68    1  asymptomatic     144   193    1        0    141   
300         301   57    1  asymptomatic     130   131    0        0    115   
301         302   57    0    nontypical     130   236    0        2    174   
302         303   38    1    nonanginal     138   175    0        0    173   

     ExAng  Oldpeak  Slope   Ca        Thal  AHD  
0        0      2.3      3  0.0       fixed   No  
1        1      1.5      2  3.0      normal  Yes  
2        1      2.6      2  2.0  reversable  Yes  
3        0      3.5      3  0.0      normal   No  
4        0      1.4      1  0.0      normal   No  
..     ...      ...    ...  ...         ...  ...  
298      0      1.2      2  0.0  reversable  Yes  
299      0      3.4      2  2.0  reversable  Yes  
300      1      1.2      2  1.0  reversable  Yes  
301      0      0.0      2  1.0      normal  Yes  
302      0      0.0      1  NaN      normal   No  

[303 rows x 15 columns]

# Numerical summaries
Heart.describe(include = 'all')

        Unnamed: 0         Age         Sex     ChestPain      RestBP  \
count   303.000000  303.000000  303.000000           303  303.000000   
unique         NaN         NaN         NaN             4         NaN   
top            NaN         NaN         NaN  asymptomatic         NaN   
freq           NaN         NaN         NaN           144         NaN   
mean    152.000000   54.438944    0.679868           NaN  131.689769   
std      87.612784    9.038662    0.467299           NaN   17.599748   
min       1.000000   29.000000    0.000000           NaN   94.000000   
25%      76.500000   48.000000    0.000000           NaN  120.000000   
50%     152.000000   56.000000    1.000000           NaN  130.000000   
75%     227.500000   61.000000    1.000000           NaN  140.000000   
max     303.000000   77.000000    1.000000           NaN  200.000000   

              Chol         Fbs     RestECG       MaxHR       ExAng  \
count   303.000000  303.000000  303.000000  303.000000  303.000000   
unique         NaN         NaN         NaN         NaN         NaN   
top            NaN         NaN         NaN         NaN         NaN   
freq           NaN         NaN         NaN         NaN         NaN   
mean    246.693069    0.148515    0.990099  149.607261    0.326733   
std      51.776918    0.356198    0.994971   22.875003    0.469794   
min     126.000000    0.000000    0.000000   71.000000    0.000000   
25%     211.000000    0.000000    0.000000  133.500000    0.000000   
50%     241.000000    0.000000    1.000000  153.000000    0.000000   
75%     275.000000    0.000000    2.000000  166.000000    1.000000   
max     564.000000    1.000000    2.000000  202.000000    1.000000   

           Oldpeak       Slope          Ca    Thal  AHD  
count   303.000000  303.000000  299.000000     301  303  
unique         NaN         NaN         NaN       3    2  
top            NaN         NaN         NaN  normal   No  
freq           NaN         NaN         NaN     166  164  
mean      1.039604    1.600660    0.672241     NaN  NaN  
std       1.161075    0.616226    0.937438     NaN  NaN  
min       0.000000    1.000000    0.000000     NaN  NaN  
25%       0.000000    1.000000    0.000000     NaN  NaN  
50%       0.800000    2.000000    0.000000     NaN  NaN  
75%       1.600000    2.000000    1.000000     NaN  NaN  
max       6.200000    3.000000    3.000000     NaN  NaN

Graphical summary:

# Graphical summaries
plt.figure()
sns.pairplot(data = Heart);
plt.show()

3 Initial split into test and non-test sets

We randomly split the data into 25% test data and 75% non-test data. Stratify on AHD.

# For reproducibility
set.seed(203)

data_split <- initial_split(
  Heart, 
  # stratify by AHD
  strata = "AHD", 
  prop = 0.75
  )
data_split

<Training/Testing/Total>
<227/76/303>

Heart_other <- training(data_split)
dim(Heart_other)

[1] 227  14

Heart_test <- testing(data_split)
dim(Heart_test)

[1] 76 14

from sklearn.model_selection import train_test_split

Heart_other, Heart_test = train_test_split(
  Heart, 
  train_size = 0.75,
  random_state = 425, # seed
  stratify = Heart.AHD
  )
Heart_test.shape

(76, 15)

Heart_other.shape

(227, 15)

Separate \(X\) and \(y\). We will use 13 features.

num_features = ['Age', 'Sex', 'RestBP', 'Chol', 'Fbs', 'RestECG', 'MaxHR', 'ExAng', 'Oldpeak', 'Slope', 'Ca']
cat_features = ['ChestPain', 'Thal']
features = np.concatenate([num_features, cat_features])
# Non-test X and y
X_other = Heart_other[features]
y_other = Heart_other.AHD
# Test X and y
X_test = Heart_test[features]
y_test = Heart_test.AHD

4 Recipe (R) and Preprocessing (Python)

A data dictionary (roughly) is at https://keras.io/examples/structured_data/structured_data_classification_with_feature_space/.
We have following features:
- Numerical features: Age, RestBP, Chol, Slope (1, 2 or 3), MaxHR, ExAng, Oldpeak, Ca (0, 1, 2 or 3).
- Categorical features coded as integer: Sex (0 or 1), Fbs (0 or 1), RestECG (0, 1 or 2).
- Categorical features coded as string: ChestPain, Thal
There are missing values in Ca and Thal. Since missing proportion is not high, we will use simple mean (for numerical feature Ca) and mode (for categorical feature Thal) imputation.

rf_recipe <- 
  recipe(
    AHD ~ ., 
    data = Heart_other
  ) %>%
  # # create traditional dummy variables (not necessary for random forest in R)
  # step_dummy(all_nominal()) %>%
  # mean imputation for Ca
  step_impute_mean(Ca) %>%
  # mode imputation for Thal
  step_impute_mode(Thal) %>%
  # zero-variance filter
  step_zv(all_numeric_predictors()) %>% 
  # # center and scale numeric data (not necessary for random forest)
  # step_normalize(all_numeric_predictors()) %>%
  # estimate the means and standard deviations
  prep(training = Heart_other, retain = TRUE)
rf_recipe

Recipe

Inputs:

      role #variables
   outcome          1
 predictor         13

Training data contained 227 data points and 4 incomplete rows. 

Operations:

Mean imputation for Ca [trained]
Mode imputation for Thal [trained]
Zero variance filter removed <none> [trained]

There are missing values in Ca (quantitative) and Thal (qualitative) variables. We are going to use simple mean imputation for Ca and most_frequent imputation for Thal. This is suboptimal. Better strategy is to use multiple imputation.

# How many NaNs
Heart.isna().sum()

Unnamed: 0    0
Age           0
Sex           0
ChestPain     0
RestBP        0
Chol          0
Fbs           0
RestECG       0
MaxHR         0
ExAng         0
Oldpeak       0
Slope         0
Ca            4
Thal          2
AHD           0
dtype: int64

In principle, decision trees should be able to handle categorical predictors. However scikit-learn and xgboost implementations don’t allow categorical predictors and require one-hot encoding.

from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Transformer for categorical variables
categorical_tf = Pipeline(steps = [
  ("cat_impute", SimpleImputer(strategy = 'most_frequent')),
  ("encoder", OneHotEncoder())
])

# Transformer for continuous variables
numeric_tf = Pipeline(steps = [
  ("num_impute", SimpleImputer(strategy = 'mean')),
])

# Column transformer
col_tf = ColumnTransformer(transformers = [
  ('num', numeric_tf, num_features),
  ('cat', categorical_tf, cat_features)
])

5 Model

rf_mod <- 
  rand_forest(
    mode = "classification",
    # Number of predictors randomly sampled in each split
    mtry = tune(),
    # Number of trees in ensemble
    trees = tune()
  ) %>% 
  set_engine("ranger")
rf_mod

Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = tune()

Computational engine: ranger

from sklearn.ensemble import RandomForestClassifier

rf_mod = RandomForestClassifier(
  # Number of trees
  n_estimators = 100, 
  criterion = 'gini',
  # Number of features to use in each split
  max_features = 'sqrt',
  oob_score = True,
  random_state = 425
  )

6 Workflow in R and pipeline in Python

Here we bundle the preprocessing step (Python) or recipe (R) and model.

rf_wf <- workflow() %>%
  add_recipe(rf_recipe) %>%
  add_model(rf_mod)
rf_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_impute_mean()
• step_impute_mode()
• step_zv()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = tune()

Computational engine: ranger

from sklearn.pipeline import Pipeline

pipe = Pipeline(steps = [
  ("col_tf", col_tf),
  ("model", rf_mod)
  ])
pipe

Pipeline(steps=[('col_tf',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('num_impute',
                                                                   SimpleImputer())]),
                                                  ['Age', 'Sex', 'RestBP',
                                                   'Chol', 'Fbs', 'RestECG',
                                                   'MaxHR', 'ExAng', 'Oldpeak',
                                                   'Slope', 'Ca']),
                                                 ('cat',
                                                  Pipeline(steps=[('cat_impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('encoder',
                                                                   OneHotEncoder())]),
                                                  ['ChestPain', 'Thal'])])),
                ('model',
                 RandomForestClassifier(oob_score=True, random_state=425))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

Pipeline(steps=[('col_tf',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('num_impute',
                                                                   SimpleImputer())]),
                                                  ['Age', 'Sex', 'RestBP',
                                                   'Chol', 'Fbs', 'RestECG',
                                                   'MaxHR', 'ExAng', 'Oldpeak',
                                                   'Slope', 'Ca']),
                                                 ('cat',
                                                  Pipeline(steps=[('cat_impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('encoder',
                                                                   OneHotEncoder())]),
                                                  ['ChestPain', 'Thal'])])),
                ('model',
                 RandomForestClassifier(oob_score=True, random_state=425))])

col_tf: ColumnTransformer

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('num_impute',
                                                  SimpleImputer())]),
                                 ['Age', 'Sex', 'RestBP', 'Chol', 'Fbs',
                                  'RestECG', 'MaxHR', 'ExAng', 'Oldpeak',
                                  'Slope', 'Ca']),
                                ('cat',
                                 Pipeline(steps=[('cat_impute',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('encoder', OneHotEncoder())]),
                                 ['ChestPain', 'Thal'])])

num

['Age', 'Sex', 'RestBP', 'Chol', 'Fbs', 'RestECG', 'MaxHR', 'ExAng', 'Oldpeak', 'Slope', 'Ca']

SimpleImputer

SimpleImputer()

cat

['ChestPain', 'Thal']

SimpleImputer

SimpleImputer(strategy='most_frequent')

OneHotEncoder

OneHotEncoder()

RandomForestClassifier

RandomForestClassifier(oob_score=True, random_state=425)

7 Tuning grid

Here we tune the number of trees trees and the number of features to use in each split mtry.

param_grid <- grid_regular(
  trees(range = c(100L, 300L)), 
  mtry(range = c(1L, 5L)),
  levels = c(3, 5)
  )
param_grid

# A tibble: 15 × 2
   trees  mtry
   <int> <int>
 1   100     1
 2   200     1
 3   300     1
 4   100     2
 5   200     2
 6   300     2
 7   100     3
 8   200     3
 9   300     3
10   100     4
11   200     4
12   300     4
13   100     5
14   200     5
15   300     5

In general, it’s not necessary to tune a random forest. Using the default of n_estimators=100 and max_features=1.0 (bagging) or max_features='sqrt' works well.

Here we tune the number of trees n_estimators and the number of features to use in each split max_features.

# Tune hyper-parameter(s)
B_grid = [50, 100, 150, 200, 250, 300]
m_grid = ['sqrt', 'log2', 1.0] # max_features = 1.0 uses all features
tuned_parameters = {
  "model__n_estimators": B_grid,
  "model__max_features": m_grid
  }
tuned_parameters

{'model__n_estimators': [50, 100, 150, 200, 250, 300], 'model__max_features': ['sqrt', 'log2', 1.0]}

8 Cross-validation (CV)

Set cross-validation partitions.

set.seed(203)

folds <- vfold_cv(Heart_other, v = 5)
folds

#  5-fold cross-validation 
# A tibble: 5 × 2
  splits           id   
  <list>           <chr>
1 <split [181/46]> Fold1
2 <split [181/46]> Fold2
3 <split [182/45]> Fold3
4 <split [182/45]> Fold4
5 <split [182/45]> Fold5

Fit cross-validation.

rf_fit <- rf_wf %>%
  tune_grid(
    resamples = folds,
    grid = param_grid,
    metrics = metric_set(roc_auc, accuracy)
    )
rf_fit

# Tuning results
# 5-fold cross-validation 
# A tibble: 5 × 4
  splits           id    .metrics          .notes          
  <list>           <chr> <list>            <list>          
1 <split [181/46]> Fold1 <tibble [30 × 6]> <tibble [0 × 3]>
2 <split [181/46]> Fold2 <tibble [30 × 6]> <tibble [0 × 3]>
3 <split [182/45]> Fold3 <tibble [30 × 6]> <tibble [0 × 3]>
4 <split [182/45]> Fold4 <tibble [30 × 6]> <tibble [0 × 3]>
5 <split [182/45]> Fold5 <tibble [30 × 6]> <tibble [0 × 3]>

Visualize CV results:

rf_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "roc_auc") %>%
  ggplot(mapping = aes(x = trees, y = mean, color = mtry)) +
  geom_point() + 
  # geom_line() + 
  labs(x = "Num. of Trees", y = "CV AUC")

# A tibble: 30 × 8
    mtry trees .metric  .estimator  mean     n std_err .config              
   <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
 1     1   100 accuracy binary     0.859     5  0.0262 Preprocessor1_Model01
 2     1   100 roc_auc  binary     0.934     5  0.0197 Preprocessor1_Model01
 3     1   200 accuracy binary     0.854     5  0.0153 Preprocessor1_Model02
 4     1   200 roc_auc  binary     0.939     5  0.0187 Preprocessor1_Model02
 5     1   300 accuracy binary     0.846     5  0.0140 Preprocessor1_Model03
 6     1   300 roc_auc  binary     0.938     5  0.0204 Preprocessor1_Model03
 7     2   100 accuracy binary     0.855     5  0.0177 Preprocessor1_Model04
 8     2   100 roc_auc  binary     0.932     5  0.0218 Preprocessor1_Model04
 9     2   200 accuracy binary     0.850     5  0.0215 Preprocessor1_Model05
10     2   200 roc_auc  binary     0.932     5  0.0200 Preprocessor1_Model05
# … with 20 more rows

Show the top 5 models.

rf_fit %>%
  show_best("roc_auc")

# A tibble: 5 × 8
   mtry trees .metric .estimator  mean     n std_err .config              
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1     1   200 roc_auc binary     0.939     5  0.0187 Preprocessor1_Model02
2     1   300 roc_auc binary     0.938     5  0.0204 Preprocessor1_Model03
3     2   300 roc_auc binary     0.937     5  0.0168 Preprocessor1_Model06
4     3   300 roc_auc binary     0.937     5  0.0175 Preprocessor1_Model09
5     1   100 roc_auc binary     0.934     5  0.0197 Preprocessor1_Model01

Let’s select the best model.

best_rf <- rf_fit %>%
  select_best("roc_auc")
best_rf

# A tibble: 1 × 3
   mtry trees .config              
  <int> <int> <chr>                
1     1   200 Preprocessor1_Model02

Set up CV partitions and CV criterion.

from sklearn.model_selection import GridSearchCV

# Set up CV
n_folds = 5
search = GridSearchCV(
  pipe,
  tuned_parameters,
  cv = n_folds, 
  scoring = "roc_auc",
  # Refit the best model on the whole data set
  refit = True
  )

Fit CV. This is typically the most time-consuming step.

# Fit CV
search.fit(X_other, y_other)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('col_tf',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('num_impute',
                                                                                          SimpleImputer())]),
                                                                         ['Age',
                                                                          'Sex',
                                                                          'RestBP',
                                                                          'Chol',
                                                                          'Fbs',
                                                                          'RestECG',
                                                                          'MaxHR',
                                                                          'ExAng',
                                                                          'Oldpeak',
                                                                          'Slope',
                                                                          'Ca']),
                                                                        ('cat',
                                                                         Pipeline(steps=[('cat_impute',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('encoder',
                                                                                          OneHotEncoder())]),
                                                                         ['ChestPain',
                                                                          'Thal'])])),
                                       ('model',
                                        RandomForestClassifier(oob_score=True,
                                                               random_state=425))]),
             param_grid={'model__max_features': ['sqrt', 'log2', 1.0],
                         'model__n_estimators': [50, 100, 150, 200, 250, 300]},
             scoring='roc_auc')

GridSearchCV

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('col_tf',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('num_impute',
                                                                                          SimpleImputer())]),
                                                                         ['Age',
                                                                          'Sex',
                                                                          'RestBP',
                                                                          'Chol',
                                                                          'Fbs',
                                                                          'RestECG',
                                                                          'MaxHR',
                                                                          'ExAng',
                                                                          'Oldpeak',
                                                                          'Slope',
                                                                          'Ca']),
                                                                        ('cat',
                                                                         Pipeline(steps=[('cat_impute',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('encoder',
                                                                                          OneHotEncoder())]),
                                                                         ['ChestPain',
                                                                          'Thal'])])),
                                       ('model',
                                        RandomForestClassifier(oob_score=True,
                                                               random_state=425))]),
             param_grid={'model__max_features': ['sqrt', 'log2', 1.0],
                         'model__n_estimators': [50, 100, 150, 200, 250, 300]},
             scoring='roc_auc')

estimator: Pipeline

Pipeline(steps=[('col_tf',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('num_impute',
                                                                   SimpleImputer())]),
                                                  ['Age', 'Sex', 'RestBP',
                                                   'Chol', 'Fbs', 'RestECG',
                                                   'MaxHR', 'ExAng', 'Oldpeak',
                                                   'Slope', 'Ca']),
                                                 ('cat',
                                                  Pipeline(steps=[('cat_impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('encoder',
                                                                   OneHotEncoder())]),
                                                  ['ChestPain', 'Thal'])])),
                ('model',
                 RandomForestClassifier(oob_score=True, random_state=425))])

col_tf: ColumnTransformer

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('num_impute',
                                                  SimpleImputer())]),
                                 ['Age', 'Sex', 'RestBP', 'Chol', 'Fbs',
                                  'RestECG', 'MaxHR', 'ExAng', 'Oldpeak',
                                  'Slope', 'Ca']),
                                ('cat',
                                 Pipeline(steps=[('cat_impute',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('encoder', OneHotEncoder())]),
                                 ['ChestPain', 'Thal'])])

num

['Age', 'Sex', 'RestBP', 'Chol', 'Fbs', 'RestECG', 'MaxHR', 'ExAng', 'Oldpeak', 'Slope', 'Ca']

SimpleImputer

SimpleImputer()

cat

['ChestPain', 'Thal']

SimpleImputer

SimpleImputer(strategy='most_frequent')

OneHotEncoder

OneHotEncoder()

RandomForestClassifier

RandomForestClassifier(oob_score=True, random_state=425)

Visualize CV results.

Code

cv_res = pd.DataFrame({
  "B": np.array(search.cv_results_["param_model__n_estimators"]),
  "auc": search.cv_results_["mean_test_score"],
  "m": search.cv_results_["param_model__max_features"]
  })

plt.figure()
sns.relplot(
  # kind = "line",
  data = cv_res,
  x = "B",
  y = "auc",
  hue = "m"
  ).set(
    xlabel = "B",
    ylabel = "CV AUC"
);
plt.show()

Best CV AUC:

search.best_score_

0.905690476190476

The training accuracy is

from sklearn.metrics import accuracy_score, roc_auc_score

accuracy_score(
  y_other,
  search.best_estimator_.predict(X_other)
  )

1.0

9 Finalize our model

Now we are done tuning. Finally, let’s fit this final model to the whole training data and use our test data to estimate the model performance we expect to see with new data.

# Final workflow
final_wf <- rf_wf %>%
  finalize_workflow(best_rf)
final_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_impute_mean()
• step_impute_mode()
• step_zv()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 1
  trees = 200

Computational engine: ranger

# Fit the whole training set, then predict the test cases
final_fit <- 
  final_wf %>%
  last_fit(data_split)
final_fit

# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits           id               .metrics .notes   .predictions .workflow 
  <list>           <chr>            <list>   <list>   <list>       <list>    
1 <split [227/76]> train/test split <tibble> <tibble> <tibble>     <workflow>

# Test metrics
final_fit %>% 
  collect_metrics()

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.763 Preprocessor1_Model1
2 roc_auc  binary         0.856 Preprocessor1_Model1

Since we called GridSearchCV with refit = True, the best model fit on the whole non-test data is readily available.

search.best_estimator_

Pipeline(steps=[('col_tf',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('num_impute',
                                                                   SimpleImputer())]),
                                                  ['Age', 'Sex', 'RestBP',
                                                   'Chol', 'Fbs', 'RestECG',
                                                   'MaxHR', 'ExAng', 'Oldpeak',
                                                   'Slope', 'Ca']),
                                                 ('cat',
                                                  Pipeline(steps=[('cat_impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('encoder',
                                                                   OneHotEncoder())]),
                                                  ['ChestPain', 'Thal'])])),
                ('model',
                 RandomForestClassifier(n_estimators=50, oob_score=True,
                                        random_state=425))])

Pipeline

Pipeline(steps=[('col_tf',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('num_impute',
                                                                   SimpleImputer())]),
                                                  ['Age', 'Sex', 'RestBP',
                                                   'Chol', 'Fbs', 'RestECG',
                                                   'MaxHR', 'ExAng', 'Oldpeak',
                                                   'Slope', 'Ca']),
                                                 ('cat',
                                                  Pipeline(steps=[('cat_impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('encoder',
                                                                   OneHotEncoder())]),
                                                  ['ChestPain', 'Thal'])])),
                ('model',
                 RandomForestClassifier(n_estimators=50, oob_score=True,
                                        random_state=425))])

col_tf: ColumnTransformer

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('num_impute',
                                                  SimpleImputer())]),
                                 ['Age', 'Sex', 'RestBP', 'Chol', 'Fbs',
                                  'RestECG', 'MaxHR', 'ExAng', 'Oldpeak',
                                  'Slope', 'Ca']),
                                ('cat',
                                 Pipeline(steps=[('cat_impute',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('encoder', OneHotEncoder())]),
                                 ['ChestPain', 'Thal'])])

num

['Age', 'Sex', 'RestBP', 'Chol', 'Fbs', 'RestECG', 'MaxHR', 'ExAng', 'Oldpeak', 'Slope', 'Ca']

SimpleImputer

SimpleImputer()

cat

['ChestPain', 'Thal']

SimpleImputer

SimpleImputer(strategy='most_frequent')

OneHotEncoder

OneHotEncoder()

RandomForestClassifier

RandomForestClassifier(n_estimators=50, oob_score=True, random_state=425)

Feature importances:

features = np.concatenate([
    features[:-2], 
    ['ChestPain:asymptomatic', 'ChestPain:nonanginal', 'ChestPain:nontypical', 'ChestPain:typical'],
    ['Thal:fixed', 'Thal:normal', 'Thal:reversable']
    ])

vi_df = pd.DataFrame({
  "feature": features,
  "vi": search.best_estimator_['model'].feature_importances_,
  "vi_std": np.std([tree.feature_importances_ for tree in search.best_estimator_['model'].estimators_], axis = 0)
  })

plt.figure()
vi_df.plot.bar(x = "feature", y = "vi", yerr = "vi_std")
plt.xticks(rotation = 90);
plt.show()

The final AUC on the test set is

roc_auc_score(
  y_test,
  search.best_estimator_.predict_proba(X_test)[:, 1]
  )

0.9055749128919861

The final classification accuracy on the test set is

accuracy_score(
  y_test,
  search.best_estimator_.predict(X_test)
  )

0.8289473684210527