Biostat 203B Homework 2

Due Feb 10 Feb 15 @ 11:59PM

Author

YOUR NAME and UID

Display machine information for reproducibility:

sessionInfo()

Load necessary libraries (you can add more as needed).

library(data.table)
library(lubridate)
library(R.utils)
library(tidyverse)

MIMIC data location

mimic_path <- "~/mimic"

In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.

Display the contents of MIMIC data folder.

system(str_c("ls -l ", mimic_path, "/"), intern = TRUE)
[1] "total 24"                                                                 
[2] "-rw-r--r--@  1 huazhou  staff  2518 Mar 10  2021 LICENSE.txt"             
[3] "drwxr-xr-x@  7 huazhou  staff   224 Jul 17  2022 MIMIC-reduce-chartevents"
[4] "-rw-r--r--@  1 huazhou  staff  2459 Mar 16  2021 SHA256SUMS.txt"          
[5] "drwxr-xr-x   6 huazhou  staff   192 Jan  4  2022 core"                    
[6] "drwxr-xr-x  21 huazhou  staff   672 Feb  3  2022 hosp"                    
[7] "drwxr-xr-x  11 huazhou  staff   352 Feb  3  2022 icu"                     
[8] "-rw-r--r--@  1 huazhou  staff   797 Jan  4  2022 index.html"              
system(str_c("ls -l ", mimic_path, "/core"), intern = TRUE)
[1] "total 142944"                                                         
[2] "-rw-r--r--@ 1 huazhou  staff  17208966 Mar 15  2021 admissions.csv.gz"
[3] "-rw-r--r--@ 1 huazhou  staff       606 Jan  4  2022 index.html"       
[4] "-rw-r--r--@ 1 huazhou  staff   2955582 Mar 15  2021 patients.csv.gz"  
[5] "-rw-r--r--@ 1 huazhou  staff  53014503 Mar 15  2021 transfers.csv.gz" 
system(str_c("ls -l ", mimic_path, "/hosp"), intern = TRUE)
 [1] "total 9243184"                                                                         
 [2] "-rw-r--r--@ 1 huazhou  staff      430049 Mar 15  2021 d_hcpcs.csv.gz"                  
 [3] "-rw-r--r--@ 1 huazhou  staff      863239 Mar 15  2021 d_icd_diagnoses.csv.gz"          
 [4] "-rw-r--r--@ 1 huazhou  staff      579998 Mar 15  2021 d_icd_procedures.csv.gz"         
 [5] "-rw-r--r--@ 1 huazhou  staff       14898 Mar 15  2021 d_labitems.csv.gz"               
 [6] "-rw-r--r--@ 1 huazhou  staff    29531152 Mar 15  2021 diagnoses_icd.csv.gz"            
 [7] "-rw-r--r--@ 1 huazhou  staff    11684062 Mar 15  2021 drgcodes.csv.gz"                 
 [8] "-rw-r--r--@ 1 huazhou  staff   515763427 Mar 15  2021 emar.csv.gz"                     
 [9] "-rw-r--r--@ 1 huazhou  staff   476252563 Mar 15  2021 emar_detail.csv.gz"              
[10] "-rw-r--r--@ 1 huazhou  staff     2098831 Mar 15  2021 hcpcsevents.csv.gz"              
[11] "-rw-r--r--@ 1 huazhou  staff        2325 Jan  4  2022 index.html"                      
[12] "-rw-r--r--@ 1 huazhou  staff  2091865786 Mar 15  2021 labevents.csv.gz"                
[13] "-rw-r--r--  1 huazhou  staff   171624288 Jan 27  2022 labevents_filtered_itemid.csv.gz"
[14] "-rw-r--r--@ 1 huazhou  staff    99133381 Mar 15  2021 microbiologyevents.csv.gz"       
[15] "-rw-r--r--@ 1 huazhou  staff   422874088 Mar 15  2021 pharmacy.csv.gz"                 
[16] "-rw-r--r--@ 1 huazhou  staff   501381155 Mar 15  2021 poe.csv.gz"                      
[17] "-rw-r--r--@ 1 huazhou  staff    24020923 Mar 15  2021 poe_detail.csv.gz"               
[18] "-rw-r--r--@ 1 huazhou  staff   367041717 Mar 15  2021 prescriptions.csv.gz"            
[19] "-rw-r--r--@ 1 huazhou  staff     7750325 Mar 15  2021 procedures_icd.csv.gz"           
[20] "-rw-r--r--@ 1 huazhou  staff     9565293 Mar 15  2021 services.csv.gz"                 
system(str_c("ls -l ", mimic_path, "/icu"), intern = TRUE)
 [1] "total 5698008"                                                                           
 [2] "-rw-r--r--@ 1 huazhou  staff  2350783547 Mar 15  2021 chartevents.csv.gz"                
 [3] "-rw-r--r--  1 huazhou  staff   110272408 Jan 27  2022 chartevents_filtered_itemid.csv.gz"
 [4] "-rw-r--r--@ 1 huazhou  staff       55917 Mar 15  2021 d_items.csv.gz"                    
 [5] "-rw-r--r--@ 1 huazhou  staff    43296273 Mar 15  2021 datetimeevents.csv.gz"             
 [6] "-rw-r--r--@ 1 huazhou  staff     2848628 Mar 15  2021 icustays.csv.gz"                   
 [7] "-rw-r--r--@ 1 huazhou  staff        1103 Jan  4  2022 index.html"                        
 [8] "-rw-r--r--@ 1 huazhou  staff   352443512 Mar 15  2021 inputevents.csv.gz"                
 [9] "-rw-r--r--@ 1 huazhou  staff    37095672 Mar 15  2021 outputevents.csv.gz"               
[10] "-rw-r--r--@ 1 huazhou  staff    20567368 Mar 15  2021 procedureevents.csv.gz"            

1 Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)

There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three programs: read.csv in base R, read_csv in tidyverse, and fread in the popular data.table package.

Which function is fastest? Is there difference in the (default) parsed data types? (Hint: R function system.time measures run times.)

For later questions, we stick to the read_csv in tidyverse.

2 Q2. ICU stays

icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are

system(
  str_c(
    "zcat < ", 
    str_c(mimic_path, "/icu/icustays.csv.gz"), 
    " | head"
    ), 
  intern = TRUE
)
 [1] "subject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los"                                                    
 [2] "17867402,24528534,31793211,Trauma SICU (TSICU),Trauma SICU (TSICU),2154-03-03 04:11:00,2154-03-04 18:16:56,1.5874537037037035" 
 [3] "14435996,28960964,31983544,Trauma SICU (TSICU),Trauma SICU (TSICU),2150-06-19 17:57:00,2150-06-22 18:33:54,3.025625"           
 [4] "17609946,27385897,33183475,Trauma SICU (TSICU),Trauma SICU (TSICU),2138-02-05 18:54:00,2138-02-15 12:42:05,9.741724537037038"  
 [5] "18966770,23483021,34131444,Trauma SICU (TSICU),Trauma SICU (TSICU),2123-10-25 10:35:00,2123-10-25 18:59:47,0.35054398148148147"
 [6] "12776735,20817525,34547665,Neuro Stepdown,Neuro Stepdown,2200-07-12 00:33:00,2200-07-13 16:44:40,1.6747685185185184"           
 [7] "10215159,24283593,34569476,Trauma SICU (TSICU),Trauma SICU (TSICU),2124-09-20 15:05:29,2124-09-21 22:06:58,1.2926967592592593" 
 [8] "14489052,26516390,35056286,Trauma SICU (TSICU),Trauma SICU (TSICU),2118-10-26 10:33:56,2118-10-26 20:28:10,0.4126620370370371" 
 [9] "15914763,28906020,36909804,Trauma SICU (TSICU),Trauma SICU (TSICU),2176-12-14 12:00:00,2176-12-17 11:47:01,2.9909837962962964" 
[10] "16256226,20013290,39289362,Neuro Stepdown,Neuro Stepdown,2150-12-20 16:09:08,2150-12-21 14:58:40,0.9510648148148149"           
  1. Import icustatys.csv.gz as a tibble icustays_tble.

  2. How many unique subject_id? Can a subject_id have multiple ICU stays?

  3. Summarize the number of ICU stays per subject_id by graphs.

  4. For each subject_id, let’s only keep the first ICU stay in the tibble icustays_tble. (Hint: slice_min and slice_max may take long. Think alternative ways to achieve the same function.)

3 Q3. admission data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are

system(
  str_c(
    "zcat < ", 
    str_c(mimic_path, "/core/admissions.csv.gz"), 
    " | head"
    ), 
  intern = TRUE
)
 [1] "subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,marital_status,ethnicity,edregtime,edouttime,hospital_expire_flag"
 [2] "14679932,21038362,2139-09-26 14:16:00,2139-09-28 11:30:00,,ELECTIVE,,HOME,Other,ENGLISH,SINGLE,UNKNOWN,,,0"                                                                                
 [3] "15585972,24941086,2123-10-07 23:56:00,2123-10-12 11:22:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0"                                                                                        
 [4] "11989120,21965160,2147-01-14 09:00:00,2147-01-17 14:25:00,,ELECTIVE,,HOME,Other,ENGLISH,,UNKNOWN,,,0"                                                                                      
 [5] "17817079,24709883,2165-12-27 17:33:00,2165-12-31 21:18:00,,ELECTIVE,,HOME,Other,ENGLISH,,OTHER,,,0"                                                                                        
 [6] "15078341,23272159,2122-08-28 08:48:00,2122-08-30 12:32:00,,ELECTIVE,,HOME,Other,ENGLISH,,BLACK/AFRICAN AMERICAN,,,0"                                                                       
 [7] "19124609,20517215,2169-03-14 12:44:00,2169-03-20 19:15:00,,ELECTIVE,,HOME,Other,ENGLISH,,UNKNOWN,,,0"                                                                                      
 [8] "17301855,29732723,2140-06-06 14:23:00,2140-06-08 14:25:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0"                                                                                        
 [9] "17991012,24298836,2181-07-10 20:28:00,2181-07-12 15:49:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0"                                                                                        
[10] "16865435,23216961,2185-07-19 02:12:00,2185-07-21 11:50:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0"                                                                                        
  1. Import admissions.csv.gz as a tibble admissions_tble.

  2. Let’s only keep the admissions that have a match in icustays_tble according to subject_id and hadmi_id.

  3. Summarize the following variables by graphics.

    • admission year
    • admission month
    • admission month day
    • admission week day
    • admission hour (anything unusual?)
    • admission minute (anything unusual?)
    • length of hospital stay (anything unusual?)

4 Q4. patients data

Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are

system(
  str_c(
    "zcat < ", 
    str_c(mimic_path, "/core/patients.csv.gz"), 
    " | head"
    ), 
  intern = TRUE
)
 [1] "subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod"
 [2] "10000048,F,23,2126,2008 - 2010,"                               
 [3] "10002723,F,0,2128,2017 - 2019,"                                
 [4] "10003939,M,0,2184,2008 - 2010,"                                
 [5] "10004222,M,0,2161,2014 - 2016,"                                
 [6] "10005325,F,0,2154,2011 - 2013,"                                
 [7] "10007338,F,0,2153,2017 - 2019,"                                
 [8] "10008101,M,0,2142,2008 - 2010,"                                
 [9] "10009872,F,0,2168,2014 - 2016,"                                
[10] "10011333,F,0,2132,2014 - 2016,"                                
  1. Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble and only keep the patients who have a match in icustays_tble (according to subject_id).

  2. Summarize variables gender and anchor_age, and explain any patterns you see.

5 Q5. Lab results

labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are

system(
  str_c(
    "zcat < ", 
    str_c(mimic_path, "/hosp/labevents.csv.gz"), 
    " | head"
    ), 
  intern = TRUE
)
 [1] "labevent_id,subject_id,hadm_id,specimen_id,itemid,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments"
 [2] "670,10000048,,6448755,51484,2126-11-22 19:20:00,2126-11-22 20:07:00,150,150,mg/dL,,,,STAT,"                                                          
 [3] "673,10000048,,6448755,51491,2126-11-22 19:20:00,2126-11-22 20:07:00,6.5,6.5,units,5,8,,STAT,"                                                        
 [4] "675,10000048,,6448755,51498,2126-11-22 19:20:00,2126-11-22 20:07:00,1.029,1.029, ,1.001,1.035,,STAT,"                                                
 [5] "683,10000048,,82729055,50861,2126-11-22 20:45:00,2126-11-23 00:55:00,39,39,IU/L,0,40,,STAT,"                                                         
 [6] "684,10000048,,82729055,50862,2126-11-22 20:45:00,2126-11-23 00:55:00,4.7,4.7,g/dL,3.4,4.8,,STAT,"                                                    
 [7] "685,10000048,,82729055,50863,2126-11-22 20:45:00,2126-11-23 00:55:00,45,45,IU/L,39,117,,STAT,"                                                       
 [8] "686,10000048,,82729055,50868,2126-11-22 20:45:00,2126-11-22 21:32:00,13,13,mEq/L,8,20,,STAT,"                                                        
 [9] "687,10000048,,82729055,50878,2126-11-22 20:45:00,2126-11-23 00:55:00,28,28,IU/L,0,40,,STAT,"                                                         
[10] "688,10000048,,82729055,50882,2126-11-22 20:45:00,2126-11-22 21:32:00,26,26,mEq/L,22,32,,STAT,"                                                       

d_labitems.csv.gz is the dictionary of lab measurements.

system(
  str_c(
    "zcat < ", 
    str_c(mimic_path, "/hosp/d_labitems.csv.gz"), 
    " | head"
    ), 
  intern = TRUE
)
 [1] "itemid,label,fluid,category,loinc_code"           
 [2] "51905, ,Other Body Fluid,Chemistry,"              
 [3] "51532,11-Deoxycorticosterone,Blood,Chemistry,"    
 [4] "51957,17-Hydroxycorticosteroids,Urine,Chemistry," 
 [5] "51958,\"17-Ketosteroids, Urine\",Urine,Chemistry,"
 [6] "52068,24 Hr,Blood,Hematology,"                    
 [7] "51066,24 hr Calcium,Urine,Chemistry,"             
 [8] "51067,24 hr Creatinine,Urine,Chemistry,"          
 [9] "51068,24 hr Protein,Urine,Chemistry,"             
[10] "50853,25-OH Vitamin D,Blood,Chemistry,"           
  1. Find how many rows are in labevents.csv.gz.

  2. We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz only containing these items for the patients in icustays_tble as a tibble labevents_tble.

    Hint: labevents.csv.gz is a data file too big to be read in by the read_csv function in its default setting. Utilize the col_select option in the read_csv function to reduce the memory burden. It took my computer 5-10 minutes to ingest this file. If your computer really has trouble importing labevents.csv.gz, you can import from the reduced data file labevents_filtered_itemid.csv.gz.

  3. Further restrict labevents_tble to the first lab measurement during the ICU stay.

  4. Summarize the lab measurements by appropriate numerics and graphics.

6 Q6. Vitals from charted events

chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

system(
  str_c(
    "zcat < ", 
    str_c(mimic_path, "/icu/chartevents.csv.gz"), 
    " | head"), 
  intern = TRUE
)
 [1] "subject_id,hadm_id,stay_id,charttime,storetime,itemid,value,valuenum,valueuom,warning"
 [2] "10003700,28623837,30600691,2165-04-24 05:10:00,2165-04-24 05:11:00,228236,0,0,,0"     
 [3] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225067,0,0,,0"     
 [4] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225070,1,1,,0"     
 [5] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225076,1,1,,0"     
 [6] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225078,1,1,,0"     
 [7] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225086,1,1,,0"     
 [8] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225091,1,1,,0"     
 [9] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225103,1,1,,0"     
[10] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225106,1,1,,0"     

d_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

system(
  str_c(
    "zcat < ", 
    str_c(mimic_path, "/icu/d_items.csv.gz"), 
    " | head"), 
  intern = TRUE
)
 [1] "itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue"   
 [2] "220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,"                
 [3] "220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,"                              
 [4] "220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,"                 
 [5] "220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,"                   
 [6] "220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,"                        
 [7] "220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140"
 [8] "220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90"
 [9] "220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,"         
[10] "220056,Arterial Blood Pressure Alarm - Low,ABP Alarm - Low,chartevents,Alarms,mmHg,Numeric,,"    
  1. We are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble as a tibble chartevents_tble.

    Hint: chartevents.csv.gz is a data file too big to be read in by the read_csv function in its default setting. Utilize the col_select option in the read_csv function to reduce the memory burden. It took my computer >15 minutes to ingest this file. If your computer really has trouble importing chartevents.csv.gz, you can import from the reduced data file chartevents_filtered_itemid.csv.gz.

  2. Further restrict chartevents_tble to the first vital measurement during the ICU stay.

  3. Summarize these vital measurements by appropriate numerics and graphics.

7 Q7. Putting things together

Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are the first ICU stay of each unique adult (age at admission > 18) and columns contain at least following variables

  • all variables in icustays.csv.gz
  • all variables in admission.csv.gz
  • all variables in patients.csv.gz
  • first lab measurements during ICU stay
  • first vital measurements during ICU stay
  • an indicator variable thirty_day_mort whether the patient died within 30 days of hospital admission (30 day mortality)

8 Q8. Exploratory data analysis (EDA)

Summarize following information using appropriate numerics or graphs.

  • thirty_day_mort vs demographic variables (ethnicity, language, insurance, marital_status, gender, age at hospital admission)

  • thirty_day_mort vs first lab measurements

  • thirty_day_mort vs first vital measurements

  • thirty_day_mort vs first ICU unit