tidymodels
Biostat 203B
1 Overview
- A typical data science project:
tidymodels is an ecosystem for:
- Feature engineering: coding qualitative predictors, transformation of predictors (e.g., log), extracting key features from raw variables (e.g., getting the day of the week out of a date variable), interaction terms, … (recipes package);
- Build and fit a model (parsnip package);
- Evaluate model using resampling (such as cross-validation) (tune and dial packages);
- Tuning model parameters.
- Feature engineering: coding qualitative predictors, transformation of predictors (e.g., log), extracting key features from raw variables (e.g., getting the day of the week out of a date variable), interaction terms, … (recipes package);
- tidymodels is the R analog of sklearn.pipeline in Python and MLJ.jl in Julia.
2 Heart data example
We illustrate a binary classification example using a dataset from the Cleveland Clinic Foundation for Heart Disease.
2.1 Logistic regression (with enet regularization) workflow
2.2 Random forest workflow
2.3 Boosting (XGBoost) workflow
2.4 SVM (with radial basis kernel) workflow
2.5 Multi-layer perceptron (MLP) workflow
2.6 Ensemble (model stacking) workflow
We differentiate homogenous ensemble (e.g., bagging, boosting) from heterogeneous ensemble (e.g., stacking). The former uses the same type of model (e.g., random forest) to build multiple models and then combine them. The latter uses different types of models (e.g., random forest, SVM, and neural network) to build multiple models and then combine them.