sessionInfo()
Biostat 203B Homework 2
Due Feb 10 Feb 15 @ 11:59PM
Display machine information for reproducibility:
Load necessary libraries (you can add more as needed).
library(data.table)
library(lubridate)
library(R.utils)
library(tidyverse)
MIMIC data location
<- "~/mimic" mimic_path
In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.
Display the contents of MIMIC data folder.
system(str_c("ls -l ", mimic_path, "/"), intern = TRUE)
[1] "total 24"
[2] "-rw-r--r--@ 1 huazhou staff 2518 Mar 10 2021 LICENSE.txt"
[3] "drwxr-xr-x@ 7 huazhou staff 224 Jul 17 2022 MIMIC-reduce-chartevents"
[4] "-rw-r--r--@ 1 huazhou staff 2459 Mar 16 2021 SHA256SUMS.txt"
[5] "drwxr-xr-x 6 huazhou staff 192 Jan 4 2022 core"
[6] "drwxr-xr-x 21 huazhou staff 672 Feb 3 2022 hosp"
[7] "drwxr-xr-x 11 huazhou staff 352 Feb 3 2022 icu"
[8] "-rw-r--r--@ 1 huazhou staff 797 Jan 4 2022 index.html"
system(str_c("ls -l ", mimic_path, "/core"), intern = TRUE)
[1] "total 142944"
[2] "-rw-r--r--@ 1 huazhou staff 17208966 Mar 15 2021 admissions.csv.gz"
[3] "-rw-r--r--@ 1 huazhou staff 606 Jan 4 2022 index.html"
[4] "-rw-r--r--@ 1 huazhou staff 2955582 Mar 15 2021 patients.csv.gz"
[5] "-rw-r--r--@ 1 huazhou staff 53014503 Mar 15 2021 transfers.csv.gz"
system(str_c("ls -l ", mimic_path, "/hosp"), intern = TRUE)
[1] "total 9243184"
[2] "-rw-r--r--@ 1 huazhou staff 430049 Mar 15 2021 d_hcpcs.csv.gz"
[3] "-rw-r--r--@ 1 huazhou staff 863239 Mar 15 2021 d_icd_diagnoses.csv.gz"
[4] "-rw-r--r--@ 1 huazhou staff 579998 Mar 15 2021 d_icd_procedures.csv.gz"
[5] "-rw-r--r--@ 1 huazhou staff 14898 Mar 15 2021 d_labitems.csv.gz"
[6] "-rw-r--r--@ 1 huazhou staff 29531152 Mar 15 2021 diagnoses_icd.csv.gz"
[7] "-rw-r--r--@ 1 huazhou staff 11684062 Mar 15 2021 drgcodes.csv.gz"
[8] "-rw-r--r--@ 1 huazhou staff 515763427 Mar 15 2021 emar.csv.gz"
[9] "-rw-r--r--@ 1 huazhou staff 476252563 Mar 15 2021 emar_detail.csv.gz"
[10] "-rw-r--r--@ 1 huazhou staff 2098831 Mar 15 2021 hcpcsevents.csv.gz"
[11] "-rw-r--r--@ 1 huazhou staff 2325 Jan 4 2022 index.html"
[12] "-rw-r--r--@ 1 huazhou staff 2091865786 Mar 15 2021 labevents.csv.gz"
[13] "-rw-r--r-- 1 huazhou staff 171624288 Jan 27 2022 labevents_filtered_itemid.csv.gz"
[14] "-rw-r--r--@ 1 huazhou staff 99133381 Mar 15 2021 microbiologyevents.csv.gz"
[15] "-rw-r--r--@ 1 huazhou staff 422874088 Mar 15 2021 pharmacy.csv.gz"
[16] "-rw-r--r--@ 1 huazhou staff 501381155 Mar 15 2021 poe.csv.gz"
[17] "-rw-r--r--@ 1 huazhou staff 24020923 Mar 15 2021 poe_detail.csv.gz"
[18] "-rw-r--r--@ 1 huazhou staff 367041717 Mar 15 2021 prescriptions.csv.gz"
[19] "-rw-r--r--@ 1 huazhou staff 7750325 Mar 15 2021 procedures_icd.csv.gz"
[20] "-rw-r--r--@ 1 huazhou staff 9565293 Mar 15 2021 services.csv.gz"
system(str_c("ls -l ", mimic_path, "/icu"), intern = TRUE)
[1] "total 5698008"
[2] "-rw-r--r--@ 1 huazhou staff 2350783547 Mar 15 2021 chartevents.csv.gz"
[3] "-rw-r--r-- 1 huazhou staff 110272408 Jan 27 2022 chartevents_filtered_itemid.csv.gz"
[4] "-rw-r--r--@ 1 huazhou staff 55917 Mar 15 2021 d_items.csv.gz"
[5] "-rw-r--r--@ 1 huazhou staff 43296273 Mar 15 2021 datetimeevents.csv.gz"
[6] "-rw-r--r--@ 1 huazhou staff 2848628 Mar 15 2021 icustays.csv.gz"
[7] "-rw-r--r--@ 1 huazhou staff 1103 Jan 4 2022 index.html"
[8] "-rw-r--r--@ 1 huazhou staff 352443512 Mar 15 2021 inputevents.csv.gz"
[9] "-rw-r--r--@ 1 huazhou staff 37095672 Mar 15 2021 outputevents.csv.gz"
[10] "-rw-r--r--@ 1 huazhou staff 20567368 Mar 15 2021 procedureevents.csv.gz"
1 Q1. read.csv
(base R) vs read_csv
(tidyverse) vs fread
(data.table)
There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz
, by three programs: read.csv
in base R, read_csv
in tidyverse, and fread
in the popular data.table package.
Which function is fastest? Is there difference in the (default) parsed data types? (Hint: R function system.time
measures run times.)
For later questions, we stick to the read_csv
in tidyverse.
2 Q2. ICU stays
icustays.csv.gz
(https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are
system(
str_c(
"zcat < ",
str_c(mimic_path, "/icu/icustays.csv.gz"),
" | head"
), intern = TRUE
)
[1] "subject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los"
[2] "17867402,24528534,31793211,Trauma SICU (TSICU),Trauma SICU (TSICU),2154-03-03 04:11:00,2154-03-04 18:16:56,1.5874537037037035"
[3] "14435996,28960964,31983544,Trauma SICU (TSICU),Trauma SICU (TSICU),2150-06-19 17:57:00,2150-06-22 18:33:54,3.025625"
[4] "17609946,27385897,33183475,Trauma SICU (TSICU),Trauma SICU (TSICU),2138-02-05 18:54:00,2138-02-15 12:42:05,9.741724537037038"
[5] "18966770,23483021,34131444,Trauma SICU (TSICU),Trauma SICU (TSICU),2123-10-25 10:35:00,2123-10-25 18:59:47,0.35054398148148147"
[6] "12776735,20817525,34547665,Neuro Stepdown,Neuro Stepdown,2200-07-12 00:33:00,2200-07-13 16:44:40,1.6747685185185184"
[7] "10215159,24283593,34569476,Trauma SICU (TSICU),Trauma SICU (TSICU),2124-09-20 15:05:29,2124-09-21 22:06:58,1.2926967592592593"
[8] "14489052,26516390,35056286,Trauma SICU (TSICU),Trauma SICU (TSICU),2118-10-26 10:33:56,2118-10-26 20:28:10,0.4126620370370371"
[9] "15914763,28906020,36909804,Trauma SICU (TSICU),Trauma SICU (TSICU),2176-12-14 12:00:00,2176-12-17 11:47:01,2.9909837962962964"
[10] "16256226,20013290,39289362,Neuro Stepdown,Neuro Stepdown,2150-12-20 16:09:08,2150-12-21 14:58:40,0.9510648148148149"
Import
icustatys.csv.gz
as a tibbleicustays_tble
.How many unique
subject_id
? Can asubject_id
have multiple ICU stays?Summarize the number of ICU stays per
subject_id
by graphs.For each
subject_id
, let’s only keep the first ICU stay in the tibbleicustays_tble
. (Hint:slice_min
andslice_max
may take long. Think alternative ways to achieve the same function.)
3 Q3. admission
data
Information of the patients admitted into hospital is available in admissions.csv.gz
. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are
system(
str_c(
"zcat < ",
str_c(mimic_path, "/core/admissions.csv.gz"),
" | head"
), intern = TRUE
)
[1] "subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,marital_status,ethnicity,edregtime,edouttime,hospital_expire_flag"
[2] "14679932,21038362,2139-09-26 14:16:00,2139-09-28 11:30:00,,ELECTIVE,,HOME,Other,ENGLISH,SINGLE,UNKNOWN,,,0"
[3] "15585972,24941086,2123-10-07 23:56:00,2123-10-12 11:22:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0"
[4] "11989120,21965160,2147-01-14 09:00:00,2147-01-17 14:25:00,,ELECTIVE,,HOME,Other,ENGLISH,,UNKNOWN,,,0"
[5] "17817079,24709883,2165-12-27 17:33:00,2165-12-31 21:18:00,,ELECTIVE,,HOME,Other,ENGLISH,,OTHER,,,0"
[6] "15078341,23272159,2122-08-28 08:48:00,2122-08-30 12:32:00,,ELECTIVE,,HOME,Other,ENGLISH,,BLACK/AFRICAN AMERICAN,,,0"
[7] "19124609,20517215,2169-03-14 12:44:00,2169-03-20 19:15:00,,ELECTIVE,,HOME,Other,ENGLISH,,UNKNOWN,,,0"
[8] "17301855,29732723,2140-06-06 14:23:00,2140-06-08 14:25:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0"
[9] "17991012,24298836,2181-07-10 20:28:00,2181-07-12 15:49:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0"
[10] "16865435,23216961,2185-07-19 02:12:00,2185-07-21 11:50:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0"
Import
admissions.csv.gz
as a tibbleadmissions_tble
.Let’s only keep the admissions that have a match in
icustays_tble
according tosubject_id
andhadmi_id
.Summarize the following variables by graphics.
- admission year
- admission month
- admission month day
- admission week day
- admission hour (anything unusual?)
- admission minute (anything unusual?)
- length of hospital stay (anything unusual?)
- admission year
4 Q4. patients
data
Patient information is available in patients.csv.gz
. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are
system(
str_c(
"zcat < ",
str_c(mimic_path, "/core/patients.csv.gz"),
" | head"
), intern = TRUE
)
[1] "subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod"
[2] "10000048,F,23,2126,2008 - 2010,"
[3] "10002723,F,0,2128,2017 - 2019,"
[4] "10003939,M,0,2184,2008 - 2010,"
[5] "10004222,M,0,2161,2014 - 2016,"
[6] "10005325,F,0,2154,2011 - 2013,"
[7] "10007338,F,0,2153,2017 - 2019,"
[8] "10008101,M,0,2142,2008 - 2010,"
[9] "10009872,F,0,2168,2014 - 2016,"
[10] "10011333,F,0,2132,2014 - 2016,"
Import
patients.csv.gz
(https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibblepatients_tble
and only keep the patients who have a match inicustays_tble
(according tosubject_id
).Summarize variables
gender
andanchor_age
, and explain any patterns you see.
5 Q5. Lab results
labevents.csv.gz
(https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are
system(
str_c(
"zcat < ",
str_c(mimic_path, "/hosp/labevents.csv.gz"),
" | head"
), intern = TRUE
)
[1] "labevent_id,subject_id,hadm_id,specimen_id,itemid,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments"
[2] "670,10000048,,6448755,51484,2126-11-22 19:20:00,2126-11-22 20:07:00,150,150,mg/dL,,,,STAT,"
[3] "673,10000048,,6448755,51491,2126-11-22 19:20:00,2126-11-22 20:07:00,6.5,6.5,units,5,8,,STAT,"
[4] "675,10000048,,6448755,51498,2126-11-22 19:20:00,2126-11-22 20:07:00,1.029,1.029, ,1.001,1.035,,STAT,"
[5] "683,10000048,,82729055,50861,2126-11-22 20:45:00,2126-11-23 00:55:00,39,39,IU/L,0,40,,STAT,"
[6] "684,10000048,,82729055,50862,2126-11-22 20:45:00,2126-11-23 00:55:00,4.7,4.7,g/dL,3.4,4.8,,STAT,"
[7] "685,10000048,,82729055,50863,2126-11-22 20:45:00,2126-11-23 00:55:00,45,45,IU/L,39,117,,STAT,"
[8] "686,10000048,,82729055,50868,2126-11-22 20:45:00,2126-11-22 21:32:00,13,13,mEq/L,8,20,,STAT,"
[9] "687,10000048,,82729055,50878,2126-11-22 20:45:00,2126-11-23 00:55:00,28,28,IU/L,0,40,,STAT,"
[10] "688,10000048,,82729055,50882,2126-11-22 20:45:00,2126-11-22 21:32:00,26,26,mEq/L,22,32,,STAT,"
d_labitems.csv.gz
is the dictionary of lab measurements.
system(
str_c(
"zcat < ",
str_c(mimic_path, "/hosp/d_labitems.csv.gz"),
" | head"
), intern = TRUE
)
[1] "itemid,label,fluid,category,loinc_code"
[2] "51905, ,Other Body Fluid,Chemistry,"
[3] "51532,11-Deoxycorticosterone,Blood,Chemistry,"
[4] "51957,17-Hydroxycorticosteroids,Urine,Chemistry,"
[5] "51958,\"17-Ketosteroids, Urine\",Urine,Chemistry,"
[6] "52068,24 Hr,Blood,Hematology,"
[7] "51066,24 hr Calcium,Urine,Chemistry,"
[8] "51067,24 hr Creatinine,Urine,Chemistry,"
[9] "51068,24 hr Protein,Urine,Chemistry,"
[10] "50853,25-OH Vitamin D,Blood,Chemistry,"
Find how many rows are in
labevents.csv.gz
.We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of
labevents.csv.gz
only containing these items for the patients inicustays_tble
as a tibblelabevents_tble
.Hint:
labevents.csv.gz
is a data file too big to be read in by theread_csv
function in its default setting. Utilize thecol_select
option in theread_csv
function to reduce the memory burden. It took my computer 5-10 minutes to ingest this file. If your computer really has trouble importinglabevents.csv.gz
, you can import from the reduced data filelabevents_filtered_itemid.csv.gz
.Further restrict
labevents_tble
to the first lab measurement during the ICU stay.Summarize the lab measurements by appropriate numerics and graphics.
6 Q6. Vitals from charted events
chartevents.csv.gz
(https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid
variable indicates a single measurement type in the database. The value
variable is the value measured for itemid
. The first 10 lines of chartevents.csv.gz
are
system(
str_c(
"zcat < ",
str_c(mimic_path, "/icu/chartevents.csv.gz"),
" | head"),
intern = TRUE
)
[1] "subject_id,hadm_id,stay_id,charttime,storetime,itemid,value,valuenum,valueuom,warning"
[2] "10003700,28623837,30600691,2165-04-24 05:10:00,2165-04-24 05:11:00,228236,0,0,,0"
[3] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225067,0,0,,0"
[4] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225070,1,1,,0"
[5] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225076,1,1,,0"
[6] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225078,1,1,,0"
[7] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225086,1,1,,0"
[8] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225091,1,1,,0"
[9] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225103,1,1,,0"
[10] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225106,1,1,,0"
d_items.csv.gz
(https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid
in chartevents.csv.gz
.
system(
str_c(
"zcat < ",
str_c(mimic_path, "/icu/d_items.csv.gz"),
" | head"),
intern = TRUE
)
[1] "itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue"
[2] "220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,"
[3] "220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,"
[4] "220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,"
[5] "220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,"
[6] "220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,"
[7] "220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140"
[8] "220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90"
[9] "220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,"
[10] "220056,Arterial Blood Pressure Alarm - Low,ABP Alarm - Low,chartevents,Alarms,mmHg,Numeric,,"
We are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of
chartevents.csv.gz
only containing these items for the patients inicustays_tble
as a tibblechartevents_tble
.Hint:
chartevents.csv.gz
is a data file too big to be read in by theread_csv
function in its default setting. Utilize thecol_select
option in theread_csv
function to reduce the memory burden. It took my computer >15 minutes to ingest this file. If your computer really has trouble importingchartevents.csv.gz
, you can import from the reduced data filechartevents_filtered_itemid.csv.gz
.Further restrict
chartevents_tble
to the first vital measurement during the ICU stay.Summarize these vital measurements by appropriate numerics and graphics.
7 Q7. Putting things together
Let us create a tibble mimic_icu_cohort
for all ICU stays, where rows are the first ICU stay of each unique adult (age at admission > 18) and columns contain at least following variables
- all variables in
icustays.csv.gz
- all variables in
admission.csv.gz
- all variables in
patients.csv.gz
- first lab measurements during ICU stay
- first vital measurements during ICU stay
- an indicator variable
thirty_day_mort
whether the patient died within 30 days of hospital admission (30 day mortality)
8 Q8. Exploratory data analysis (EDA)
Summarize following information using appropriate numerics or graphs.
thirty_day_mort
vs demographic variables (ethnicity, language, insurance, marital_status, gender, age at hospital admission)thirty_day_mort
vs first lab measurementsthirty_day_mort
vs first vital measurementsthirty_day_mort
vs first ICU unit