tidymodels

Biostat 203B

Author

Dr. Hua Zhou @ UCLA

Published

February 29, 2024

1 Overview

A typical data science project:

tidymodels is an ecosystem for:
1. Feature engineering: coding qualitative predictors, transformation of predictors (e.g., log), extracting key features from raw variables (e.g., getting the day of the week out of a date variable), interaction terms, … (recipes package);
2. Build and fit a model (parsnip package);
3. Evaluate model using resampling (such as cross-validation) (tune and dial packages);
4. Tuning model parameters.

tidymodels is the R analog of sklearn.pipeline in Python and MLJ.jl in Julia.

2 Heart data example

We illustrate a binary classification example using a dataset from the Cleveland Clinic Foundation for Heart Disease.

2.1 Logistic regression (with enet regularization) workflow

qmd, html

2.2 Random forest workflow

qmd, html

2.3 Boosting (XGBoost) workflow

qmd, html

2.4 SVM (with radial basis kernel) workflow

qmd, html

2.5 Multi-layer perceptron (MLP) workflow

qmd, html

2.6 Ensemble (model stacking) workflow

We differentiate homogenous ensemble (e.g., bagging, boosting) from heterogeneous ensemble (e.g., stacking). The former uses the same type of model (e.g., random forest) to build multiple models and then combine them. The latter uses different types of models (e.g., random forest, SVM, and neural network) to build multiple models and then combine them.

qmd, html