set.seed(203)
# sort
<- mimiciv_icu_cohort |>
mimiciv_icu_cohort arrange(subject_id, hadm_id, stay_id)
<- initial_split(
data_split
mimiciv_icu_cohort, # stratify by los_long
strata = "los_long",
prop = 0.5
)
Biostat 203B Homework 5
Due Mar 22 @ 11:59PM
Predicting ICU duration
Using the ICU cohort mimiciv_icu_cohort.rds
you built in Homework 4, develop at least three machine learning approaches (logistic regression with enet regularization, random forest, boosting, SVM, MLP, etc) plus a model stacking approach for predicting whether a patient’s ICU stay will be longer than 2 days. You should use the los_long
variable as the outcome. You algorithms can use patient demographic information (gender, age at ICU intime
, marital status, race), ICU admission information (first care unit), the last lab measurements before the ICU stay, and first vital measurements during ICU stay as features. You are welcome to use any feature engineering techniques you think are appropriate; but make sure to not use features that are not available at an ICU stay’s intime
. For instance, last_careunit
cannot be used in your algorithms.
Data preprocessing and feature engineering.
Partition data into 50% training set and 50% test set. Stratify partitioning according to
los_long
. For grading purpose, sort the data bysubject_id
,hadm_id
, andstay_id
and use the seed203
for the initial data split. Below is the sample code.
Train and tune the models using the training set.
Compare model classification performance on the test set. Report both the area under ROC curve and accuracy for each machine learning algorithm and the model stacking. Interpret the results. What are the most important features in predicting long ICU stays? How do the models compare in terms of performance and interpretability?