Biostat 203B Homework 2

Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)
Q2. ICU stays
Q3. admission data
Q4. patients data
Q5. Lab results
Q6. Vitals from charted events
Q7. Putting things together
Q8. Exploratory data analysis (EDA)

Display machine information for reproducibility:

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.29   R6_2.5.1        jsonlite_1.7.2  magrittr_2.0.1 
##  [5] evaluate_0.14   rlang_0.4.12    stringi_1.7.6   jquerylib_0.1.4
##  [9] bslib_0.3.1     rmarkdown_2.11  tools_3.6.0     stringr_1.4.0  
## [13] xfun_0.29       yaml_2.2.1      fastmap_1.1.0   compiler_3.6.0 
## [17] htmltools_0.5.2 knitr_1.37      sass_0.4.0

knitr::opts_chunk$set(echo = TRUE, cache = TRUE, cache.lazy = FALSE)
library(tidyverse)
library(data.table)
library(lubridate)

os <- sessionInfo()$running
if (str_detect(os, "Linux")) {
  mimic_path <- "/mnt/mimiciv/1.0"
} else if (str_detect(os, "macOS")) {
  mimic_path <- "/Users/huazhou/Documents/Box Sync/MIMIC/mimic-iv-1.0"
}

In this exercise, we use tidyverse (ggpot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.

# tree -s -L 2 /Users/huazhou/Documents/Box\ Sync/MIMIC/mimic-iv-1.0
system(str_c("tree -s -L 2 ", shQuote(mimic_path)), intern = TRUE)

##  [1] "/mnt/mimiciv/1.0"                                
##  [2] "├── [       4096]  core"                         
##  [3] "│   ├── [   17208966]  admissions.csv.gz"        
##  [4] "│   ├── [        606]  index.html"               
##  [5] "│   ├── [    2955582]  patients.csv.gz"          
##  [6] "│   └── [   53014503]  transfers.csv.gz"         
##  [7] "├── [       4096]  hosp"                         
##  [8] "│   ├── [     430049]  d_hcpcs.csv.gz"           
##  [9] "│   ├── [   29531152]  diagnoses_icd.csv.gz"     
## [10] "│   ├── [     863239]  d_icd_diagnoses.csv.gz"   
## [11] "│   ├── [     579998]  d_icd_procedures.csv.gz"  
## [12] "│   ├── [      14898]  d_labitems.csv.gz"        
## [13] "│   ├── [   11684062]  drgcodes.csv.gz"          
## [14] "│   ├── [  515763427]  emar.csv.gz"              
## [15] "│   ├── [  476252563]  emar_detail.csv.gz"       
## [16] "│   ├── [    2098831]  hcpcsevents.csv.gz"       
## [17] "│   ├── [       2325]  index.html"               
## [18] "│   ├── [ 2091865786]  labevents.csv.gz"         
## [19] "│   ├── [   99133381]  microbiologyevents.csv.gz"
## [20] "│   ├── [  422874088]  pharmacy.csv.gz"          
## [21] "│   ├── [  501381155]  poe.csv.gz"               
## [22] "│   ├── [   24020923]  poe_detail.csv.gz"        
## [23] "│   ├── [  367041717]  prescriptions.csv.gz"     
## [24] "│   ├── [    7750325]  procedures_icd.csv.gz"    
## [25] "│   └── [    9565293]  services.csv.gz"          
## [26] "├── [       4096]  icu"                          
## [27] "│   ├── [ 2350783547]  chartevents.csv.gz"       
## [28] "│   ├── [   43296273]  datetimeevents.csv.gz"    
## [29] "│   ├── [      55917]  d_items.csv.gz"           
## [30] "│   ├── [    2848628]  icustays.csv.gz"          
## [31] "│   ├── [       1103]  index.html"               
## [32] "│   ├── [  352443512]  inputevents.csv.gz"       
## [33] "│   ├── [   37095672]  outputevents.csv.gz"      
## [34] "│   └── [   20567368]  procedureevents.csv.gz"   
## [35] "├── [        797]  index.html"                   
## [36] "├── [       2518]  LICENSE.txt"                  
## [37] "└── [       2459]  SHA256SUMS.txt"               
## [38] ""                                                
## [39] "3 directories, 33 files"

Q1. `read.csv` (base R) vs `read_csv` (tidyverse) vs `fread` (data.table)

There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three programs: read.csv in base R, read_csv in tidyverse, and fread in the popular data.table package.

Which function is fastest? Is there difference in the (default) parsed data types? (Hint: R function system.time measures run times.)

For later questions, we stick to the tidyverse.

Q2. ICU stays

icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are

system(
  str_c(
    "zcat < ", 
    shQuote(str_c(mimic_path, "/icu/icustays.csv.gz")), 
    " | head"
    ), 
  intern = TRUE
)

##  [1] "subject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los"                                                    
##  [2] "17867402,24528534,31793211,Trauma SICU (TSICU),Trauma SICU (TSICU),2154-03-03 04:11:00,2154-03-04 18:16:56,1.5874537037037035" 
##  [3] "14435996,28960964,31983544,Trauma SICU (TSICU),Trauma SICU (TSICU),2150-06-19 17:57:00,2150-06-22 18:33:54,3.025625"           
##  [4] "17609946,27385897,33183475,Trauma SICU (TSICU),Trauma SICU (TSICU),2138-02-05 18:54:00,2138-02-15 12:42:05,9.741724537037038"  
##  [5] "18966770,23483021,34131444,Trauma SICU (TSICU),Trauma SICU (TSICU),2123-10-25 10:35:00,2123-10-25 18:59:47,0.35054398148148147"
##  [6] "12776735,20817525,34547665,Neuro Stepdown,Neuro Stepdown,2200-07-12 00:33:00,2200-07-13 16:44:40,1.6747685185185184"           
##  [7] "10215159,24283593,34569476,Trauma SICU (TSICU),Trauma SICU (TSICU),2124-09-20 15:05:29,2124-09-21 22:06:58,1.2926967592592593" 
##  [8] "14489052,26516390,35056286,Trauma SICU (TSICU),Trauma SICU (TSICU),2118-10-26 10:33:56,2118-10-26 20:28:10,0.4126620370370371" 
##  [9] "15914763,28906020,36909804,Trauma SICU (TSICU),Trauma SICU (TSICU),2176-12-14 12:00:00,2176-12-17 11:47:01,2.9909837962962964" 
## [10] "16256226,20013290,39289362,Neuro Stepdown,Neuro Stepdown,2150-12-20 16:09:08,2150-12-21 14:58:40,0.9510648148148149"

Import icustatys.csv.gz as a tibble icustays_tble.
How many unique subject_id? Can a subject_id have multiple ICU stays?
For each subject_id, let’s only keep the first ICU stay in the tibble icustays_tble.

Q3. `admission` data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/core/admissions/ for details of each field in this file. The first 10 lines are

system(
  str_c(
    "zcat < ", 
    shQuote(str_c(mimic_path, "/core/admissions.csv.gz")), 
    " | head"
    ), 
  intern = TRUE
)

##  [1] "subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,marital_status,ethnicity,edregtime,edouttime,hospital_expire_flag"
##  [2] "14679932,21038362,2139-09-26 14:16:00,2139-09-28 11:30:00,,ELECTIVE,,HOME,Other,ENGLISH,SINGLE,UNKNOWN,,,0"                                                                                
##  [3] "15585972,24941086,2123-10-07 23:56:00,2123-10-12 11:22:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0"                                                                                        
##  [4] "11989120,21965160,2147-01-14 09:00:00,2147-01-17 14:25:00,,ELECTIVE,,HOME,Other,ENGLISH,,UNKNOWN,,,0"                                                                                      
##  [5] "17817079,24709883,2165-12-27 17:33:00,2165-12-31 21:18:00,,ELECTIVE,,HOME,Other,ENGLISH,,OTHER,,,0"                                                                                        
##  [6] "15078341,23272159,2122-08-28 08:48:00,2122-08-30 12:32:00,,ELECTIVE,,HOME,Other,ENGLISH,,BLACK/AFRICAN AMERICAN,,,0"                                                                       
##  [7] "19124609,20517215,2169-03-14 12:44:00,2169-03-20 19:15:00,,ELECTIVE,,HOME,Other,ENGLISH,,UNKNOWN,,,0"                                                                                      
##  [8] "17301855,29732723,2140-06-06 14:23:00,2140-06-08 14:25:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0"                                                                                        
##  [9] "17991012,24298836,2181-07-10 20:28:00,2181-07-12 15:49:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0"                                                                                        
## [10] "16865435,23216961,2185-07-19 02:12:00,2185-07-21 11:50:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0"

Import admissions.csv.gz as a tibble admissions_tble.
Let’s only keep the admissions that have a match in icustays_tble according to subject_id and hadmi_id.
Summarize the following variables by graphics.

admission year
admission month
admission month day
admission week day
admission hour (anything unusual?)

Q4. `patients` data

Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/core/patients/ for details of each field in this file. The first 10 lines are

system(
  str_c(
    "zcat < ", 
    shQuote(str_c(mimic_path, "/core/patients.csv.gz")), 
    " | head"
    ), 
  intern = TRUE
)

##  [1] "subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod"
##  [2] "10000048,F,23,2126,2008 - 2010,"                               
##  [3] "10002723,F,0,2128,2017 - 2019,"                                
##  [4] "10003939,M,0,2184,2008 - 2010,"                                
##  [5] "10004222,M,0,2161,2014 - 2016,"                                
##  [6] "10005325,F,0,2154,2011 - 2013,"                                
##  [7] "10007338,F,0,2153,2017 - 2019,"                                
##  [8] "10008101,M,0,2142,2008 - 2010,"                                
##  [9] "10009872,F,0,2168,2014 - 2016,"                                
## [10] "10011333,F,0,2132,2014 - 2016,"

Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/core/patients/) as a tibble patients_tble and only keep the patients who have a match in icustays_tble (according to subject_id).
Summarize variables gender and anchor_age, and explain any patterns you see.

Q5. Lab results

labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are

system(
  str_c(
    "zcat < ", 
    shQuote(str_c(mimic_path, "/hosp/labevents.csv.gz")), 
    " | head"
    ), 
  intern = TRUE
)

##  [1] "labevent_id,subject_id,hadm_id,specimen_id,itemid,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments"
##  [2] "670,10000048,,6448755,51484,2126-11-22 19:20:00,2126-11-22 20:07:00,150,150,mg/dL,,,,STAT,"                                                          
##  [3] "673,10000048,,6448755,51491,2126-11-22 19:20:00,2126-11-22 20:07:00,6.5,6.5,units,5,8,,STAT,"                                                        
##  [4] "675,10000048,,6448755,51498,2126-11-22 19:20:00,2126-11-22 20:07:00,1.029,1.029, ,1.001,1.035,,STAT,"                                                
##  [5] "683,10000048,,82729055,50861,2126-11-22 20:45:00,2126-11-23 00:55:00,39,39,IU/L,0,40,,STAT,"                                                         
##  [6] "684,10000048,,82729055,50862,2126-11-22 20:45:00,2126-11-23 00:55:00,4.7,4.7,g/dL,3.4,4.8,,STAT,"                                                    
##  [7] "685,10000048,,82729055,50863,2126-11-22 20:45:00,2126-11-23 00:55:00,45,45,IU/L,39,117,,STAT,"                                                       
##  [8] "686,10000048,,82729055,50868,2126-11-22 20:45:00,2126-11-22 21:32:00,13,13,mEq/L,8,20,,STAT,"                                                        
##  [9] "687,10000048,,82729055,50878,2126-11-22 20:45:00,2126-11-23 00:55:00,28,28,IU/L,0,40,,STAT,"                                                         
## [10] "688,10000048,,82729055,50882,2126-11-22 20:45:00,2126-11-22 21:32:00,26,26,mEq/L,22,32,,STAT,"

d_labitems.csv.gz is the dictionary of lab measurements.

system(
  str_c(
    "zcat < ", 
    shQuote(str_c(mimic_path, "/hosp/d_labitems.csv.gz")), 
    " | head"
    ), 
  intern = TRUE
)

##  [1] "itemid,label,fluid,category,loinc_code"           
##  [2] "51905, ,Other Body Fluid,Chemistry,"              
##  [3] "51532,11-Deoxycorticosterone,Blood,Chemistry,"    
##  [4] "51957,17-Hydroxycorticosteroids,Urine,Chemistry," 
##  [5] "51958,\"17-Ketosteroids, Urine\",Urine,Chemistry,"
##  [6] "52068,24 Hr,Blood,Hematology,"                    
##  [7] "51066,24 hr Calcium,Urine,Chemistry,"             
##  [8] "51067,24 hr Creatinine,Urine,Chemistry,"          
##  [9] "51068,24 hr Protein,Urine,Chemistry,"             
## [10] "50853,25-OH Vitamin D,Blood,Chemistry,"

Find how many rows are in labevents.csv.gz.
We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), glucose (50931), magnesium (50960), and calcium (50893). Retrieve a subset of labevents.csv.gz only containing these items for the patients in icustays_tble as a tibble labevents_tble.

Hint: labevents.csv.gz is a data file too big to be read in by the read_csv function in its default setting. Utilize the col_select and lazy options in the read_csv function to reduce the memory burden.
Further restrict labevents_tble to the first lab measurement during the ICU stay.
Summarize the lab measurements by appropriate numerics and graphics.

Q6. Vitals from charted events

chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

system(
  str_c(
    "zcat < ", 
    shQuote(str_c(mimic_path, "/icu/chartevents.csv.gz")), 
    " | head"), 
  intern = TRUE
)

##  [1] "subject_id,hadm_id,stay_id,charttime,storetime,itemid,value,valuenum,valueuom,warning"
##  [2] "10003700,28623837,30600691,2165-04-24 05:10:00,2165-04-24 05:11:00,228236,0,0,,0"     
##  [3] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225067,0,0,,0"     
##  [4] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225070,1,1,,0"     
##  [5] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225076,1,1,,0"     
##  [6] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225078,1,1,,0"     
##  [7] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225086,1,1,,0"     
##  [8] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225091,1,1,,0"     
##  [9] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225103,1,1,,0"     
## [10] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225106,1,1,,0"

d_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

system(
  str_c(
    "zcat < ", 
    shQuote(str_c(mimic_path, "/icu/d_items.csv.gz")), 
    " | head"), 
  intern = TRUE
)

##  [1] "itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue"   
##  [2] "220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,"                
##  [3] "220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,"                              
##  [4] "220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,"                 
##  [5] "220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,"                   
##  [6] "220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,"                        
##  [7] "220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140"
##  [8] "220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90"
##  [9] "220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,"         
## [10] "220056,Arterial Blood Pressure Alarm - Low,ABP Alarm - Low,chartevents,Alarms,mmHg,Numeric,,"

We are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble as a tibble chartevents_tble.
Further restrict chartevents_tble to the first vital measurement during the ICU stay.
Summarize these vital measurements by appropriate numerics and graphics.

Q7. Putting things together

Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are

first ICU stay of each unique adult (age at admission > 18)

and columns contain at least following variables

all variables in icustays.csv.gz
all variables in admission.csv.gz
all variables in patients.csv.gz
first lab measurements during ICU stay
first vital measurements during ICU stay
an indicator variable thirty_day_mort whether the patient died within 30 days of hospital admission (30 day mortality)

Q8. Exploratory data analysis (EDA)

Summarize following information using appropriate numerics or graphs.

thirty_day_mort vs demographic variables (ethnicity, language, insurance, marital_status, gender, age at hospital admission)
thirty_day_mort vs first lab measurements
thirty_day_mort vs first vital measurements
thirty_day_mort vs first ICU unit

Biostat 203B Homework 2

Due Feb 6 @ 11:59PM

Your Name

Q1. `read.csv` (base R) vs `read_csv` (tidyverse) vs `fread` (data.table)

Q2. ICU stays

Q3. `admission` data

Q4. `patients` data

Q5. Lab results

Q6. Vitals from charted events

Q7. Putting things together

Q8. Exploratory data analysis (EDA)

Biostat 203B Homework 2

Due Feb 6 @ 11:59PM

Your Name

Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)

Q2. ICU stays

Q3. admission data

Q4. patients data

Q5. Lab results

Q6. Vitals from charted events

Q7. Putting things together

Q8. Exploratory data analysis (EDA)

Q1. `read.csv` (base R) vs `read_csv` (tidyverse) vs `fread` (data.table)

Q3. `admission` data

Q4. `patients` data