Display machine information:

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.29   R6_2.5.1        jsonlite_1.7.3  magrittr_2.0.2 
##  [5] evaluate_0.14   stringi_1.7.6   rlang_1.0.1     cli_3.1.1      
##  [9] rstudioapi_0.13 jquerylib_0.1.4 bslib_0.3.1     rmarkdown_2.11 
## [13] tools_4.1.2     stringr_1.4.0   xfun_0.29       yaml_2.2.1     
## [17] fastmap_1.1.0   compiler_4.1.2  htmltools_0.5.2 knitr_1.37     
## [21] sass_0.4.0

Load database libraries and the tidyverse frontend:

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(miceRanger))

Q1. Missing data

Through the Shiny app developed in HW3, we observe abundant missing values in the MIMIC-IV ICU cohort we created. In this question, we use multiple imputation to obtain a data set without missing values.

  1. Read following tutorials on the R package miceRanger for imputation: https://github.com/farrellday/miceRanger, https://cran.r-project.org/web/packages/miceRanger/vignettes/miceAlgorithm.html.

    A more thorough book treatment of the practical imputation strategies is the book Flexible Imputation of Missing Data by Stef van Buuren.

  2. Explain the jargon MCAR, MAR, and MNAR.

  3. Explain in a couple of sentences how the Multiple Imputation by Chained Equations (MICE) work.

  4. Perform a data quality check of the ICU stays data. Discard variables with substantial missingness, say >5000 NAs. Replace apparent data entry errors by NAs.

  5. Impute missing values by miceRanger (request \(m=3\) data sets). This step is computational intensive. Make sure to save the imputation results as a file. Hint: Setting max.depth=10 in the miceRanger function may cut some computing time.

  6. Make imputation diagnostic plots and explain what they mean.

  7. Choose one of the imputed data sets to be used in Q2. This is not a good idea to use just one imputed data set or to average multiple imputed data sets. Explain in a couple of sentences what the correct Multiple Imputation strategy is.

Q2. Predicting 30-day mortality

Develop at least two analytic approaches for predicting the 30-day mortality of patients admitted to ICU using demographic information (gender, age, marital status, ethnicity), first lab measurements during ICU stay, and first vital measurements during ICU stay. For example, you can use (1) logistic regression (glm() function in base R or keras), (2) logistic regression with lasso penalty (glmnet or keras package), (3) random forest (randomForest package), or (4) neural network (keras package).

  1. Partition data into 80% training set and 20% test set. Stratify partitioning according the 30-day mortality status.

  2. Train the models using the training set.

  3. Compare model prediction performance on the test set.