Display machine information for reproducibility:
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.29 R6_2.5.1 jsonlite_1.7.2 magrittr_2.0.1
## [5] evaluate_0.14 rlang_0.4.12 stringi_1.7.6 jquerylib_0.1.4
## [9] bslib_0.3.1 rmarkdown_2.11 tools_3.6.0 stringr_1.4.0
## [13] xfun_0.29 yaml_2.2.1 fastmap_1.1.0 compiler_3.6.0
## [17] htmltools_0.5.2 knitr_1.37 sass_0.4.0
knitr::opts_chunk$set(echo = TRUE, cache = TRUE, cache.lazy = FALSE)
library(tidyverse)
library(data.table)
library(lubridate)
os <- sessionInfo()$running
if (str_detect(os, "Linux")) {
mimic_path <- "/mnt/mimiciv/1.0"
} else if (str_detect(os, "macOS")) {
mimic_path <- "/Users/huazhou/Documents/Box Sync/MIMIC/mimic-iv-1.0"
}
In this exercise, we use tidyverse (ggpot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.
# tree -s -L 2 /Users/huazhou/Documents/Box\ Sync/MIMIC/mimic-iv-1.0
system(str_c("tree -s -L 2 ", shQuote(mimic_path)), intern = TRUE)
## [1] "/mnt/mimiciv/1.0"
## [2] "├── [ 4096] core"
## [3] "│ ├── [ 17208966] admissions.csv.gz"
## [4] "│ ├── [ 606] index.html"
## [5] "│ ├── [ 2955582] patients.csv.gz"
## [6] "│ └── [ 53014503] transfers.csv.gz"
## [7] "├── [ 4096] hosp"
## [8] "│ ├── [ 430049] d_hcpcs.csv.gz"
## [9] "│ ├── [ 29531152] diagnoses_icd.csv.gz"
## [10] "│ ├── [ 863239] d_icd_diagnoses.csv.gz"
## [11] "│ ├── [ 579998] d_icd_procedures.csv.gz"
## [12] "│ ├── [ 14898] d_labitems.csv.gz"
## [13] "│ ├── [ 11684062] drgcodes.csv.gz"
## [14] "│ ├── [ 515763427] emar.csv.gz"
## [15] "│ ├── [ 476252563] emar_detail.csv.gz"
## [16] "│ ├── [ 2098831] hcpcsevents.csv.gz"
## [17] "│ ├── [ 2325] index.html"
## [18] "│ ├── [ 2091865786] labevents.csv.gz"
## [19] "│ ├── [ 99133381] microbiologyevents.csv.gz"
## [20] "│ ├── [ 422874088] pharmacy.csv.gz"
## [21] "│ ├── [ 501381155] poe.csv.gz"
## [22] "│ ├── [ 24020923] poe_detail.csv.gz"
## [23] "│ ├── [ 367041717] prescriptions.csv.gz"
## [24] "│ ├── [ 7750325] procedures_icd.csv.gz"
## [25] "│ └── [ 9565293] services.csv.gz"
## [26] "├── [ 4096] icu"
## [27] "│ ├── [ 2350783547] chartevents.csv.gz"
## [28] "│ ├── [ 43296273] datetimeevents.csv.gz"
## [29] "│ ├── [ 55917] d_items.csv.gz"
## [30] "│ ├── [ 2848628] icustays.csv.gz"
## [31] "│ ├── [ 1103] index.html"
## [32] "│ ├── [ 352443512] inputevents.csv.gz"
## [33] "│ ├── [ 37095672] outputevents.csv.gz"
## [34] "│ └── [ 20567368] procedureevents.csv.gz"
## [35] "├── [ 797] index.html"
## [36] "├── [ 2518] LICENSE.txt"
## [37] "└── [ 2459] SHA256SUMS.txt"
## [38] ""
## [39] "3 directories, 33 files"
read.csv
(base R) vs read_csv
(tidyverse) vs fread
(data.table)There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz
, by three programs: read.csv
in base R, read_csv
in tidyverse, and fread
in the popular data.table package.
Which function is fastest? Is there difference in the (default) parsed data types? (Hint: R function system.time
measures run times.)
For later questions, we stick to the tidyverse.
icustays.csv.gz
(https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are
system(
str_c(
"zcat < ",
shQuote(str_c(mimic_path, "/icu/icustays.csv.gz")),
" | head"
),
intern = TRUE
)
## [1] "subject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los"
## [2] "17867402,24528534,31793211,Trauma SICU (TSICU),Trauma SICU (TSICU),2154-03-03 04:11:00,2154-03-04 18:16:56,1.5874537037037035"
## [3] "14435996,28960964,31983544,Trauma SICU (TSICU),Trauma SICU (TSICU),2150-06-19 17:57:00,2150-06-22 18:33:54,3.025625"
## [4] "17609946,27385897,33183475,Trauma SICU (TSICU),Trauma SICU (TSICU),2138-02-05 18:54:00,2138-02-15 12:42:05,9.741724537037038"
## [5] "18966770,23483021,34131444,Trauma SICU (TSICU),Trauma SICU (TSICU),2123-10-25 10:35:00,2123-10-25 18:59:47,0.35054398148148147"
## [6] "12776735,20817525,34547665,Neuro Stepdown,Neuro Stepdown,2200-07-12 00:33:00,2200-07-13 16:44:40,1.6747685185185184"
## [7] "10215159,24283593,34569476,Trauma SICU (TSICU),Trauma SICU (TSICU),2124-09-20 15:05:29,2124-09-21 22:06:58,1.2926967592592593"
## [8] "14489052,26516390,35056286,Trauma SICU (TSICU),Trauma SICU (TSICU),2118-10-26 10:33:56,2118-10-26 20:28:10,0.4126620370370371"
## [9] "15914763,28906020,36909804,Trauma SICU (TSICU),Trauma SICU (TSICU),2176-12-14 12:00:00,2176-12-17 11:47:01,2.9909837962962964"
## [10] "16256226,20013290,39289362,Neuro Stepdown,Neuro Stepdown,2150-12-20 16:09:08,2150-12-21 14:58:40,0.9510648148148149"
Import icustatys.csv.gz
as a tibble icustays_tble
.
How many unique subject_id
? Can a subject_id
have multiple ICU stays?
For each subject_id
, let’s only keep the first ICU stay in the tibble icustays_tble
.
admission
dataInformation of the patients admitted into hospital is available in admissions.csv.gz
. See https://mimic.mit.edu/docs/iv/modules/core/admissions/ for details of each field in this file. The first 10 lines are
system(
str_c(
"zcat < ",
shQuote(str_c(mimic_path, "/core/admissions.csv.gz")),
" | head"
),
intern = TRUE
)
## [1] "subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,marital_status,ethnicity,edregtime,edouttime,hospital_expire_flag"
## [2] "14679932,21038362,2139-09-26 14:16:00,2139-09-28 11:30:00,,ELECTIVE,,HOME,Other,ENGLISH,SINGLE,UNKNOWN,,,0"
## [3] "15585972,24941086,2123-10-07 23:56:00,2123-10-12 11:22:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0"
## [4] "11989120,21965160,2147-01-14 09:00:00,2147-01-17 14:25:00,,ELECTIVE,,HOME,Other,ENGLISH,,UNKNOWN,,,0"
## [5] "17817079,24709883,2165-12-27 17:33:00,2165-12-31 21:18:00,,ELECTIVE,,HOME,Other,ENGLISH,,OTHER,,,0"
## [6] "15078341,23272159,2122-08-28 08:48:00,2122-08-30 12:32:00,,ELECTIVE,,HOME,Other,ENGLISH,,BLACK/AFRICAN AMERICAN,,,0"
## [7] "19124609,20517215,2169-03-14 12:44:00,2169-03-20 19:15:00,,ELECTIVE,,HOME,Other,ENGLISH,,UNKNOWN,,,0"
## [8] "17301855,29732723,2140-06-06 14:23:00,2140-06-08 14:25:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0"
## [9] "17991012,24298836,2181-07-10 20:28:00,2181-07-12 15:49:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0"
## [10] "16865435,23216961,2185-07-19 02:12:00,2185-07-21 11:50:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0"
Import admissions.csv.gz
as a tibble admissions_tble
.
Let’s only keep the admissions that have a match in icustays_tble
according to subject_id
and hadmi_id
.
Summarize the following variables by graphics.
patients
dataPatient information is available in patients.csv.gz
. See https://mimic.mit.edu/docs/iv/modules/core/patients/ for details of each field in this file. The first 10 lines are
system(
str_c(
"zcat < ",
shQuote(str_c(mimic_path, "/core/patients.csv.gz")),
" | head"
),
intern = TRUE
)
## [1] "subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod"
## [2] "10000048,F,23,2126,2008 - 2010,"
## [3] "10002723,F,0,2128,2017 - 2019,"
## [4] "10003939,M,0,2184,2008 - 2010,"
## [5] "10004222,M,0,2161,2014 - 2016,"
## [6] "10005325,F,0,2154,2011 - 2013,"
## [7] "10007338,F,0,2153,2017 - 2019,"
## [8] "10008101,M,0,2142,2008 - 2010,"
## [9] "10009872,F,0,2168,2014 - 2016,"
## [10] "10011333,F,0,2132,2014 - 2016,"
Import patients.csv.gz
(https://mimic.mit.edu/docs/iv/modules/core/patients/) as a tibble patients_tble
and only keep the patients who have a match in icustays_tble
(according to subject_id
).
Summarize variables gender
and anchor_age
, and explain any patterns you see.
labevents.csv.gz
(https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are
system(
str_c(
"zcat < ",
shQuote(str_c(mimic_path, "/hosp/labevents.csv.gz")),
" | head"
),
intern = TRUE
)
## [1] "labevent_id,subject_id,hadm_id,specimen_id,itemid,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments"
## [2] "670,10000048,,6448755,51484,2126-11-22 19:20:00,2126-11-22 20:07:00,150,150,mg/dL,,,,STAT,"
## [3] "673,10000048,,6448755,51491,2126-11-22 19:20:00,2126-11-22 20:07:00,6.5,6.5,units,5,8,,STAT,"
## [4] "675,10000048,,6448755,51498,2126-11-22 19:20:00,2126-11-22 20:07:00,1.029,1.029, ,1.001,1.035,,STAT,"
## [5] "683,10000048,,82729055,50861,2126-11-22 20:45:00,2126-11-23 00:55:00,39,39,IU/L,0,40,,STAT,"
## [6] "684,10000048,,82729055,50862,2126-11-22 20:45:00,2126-11-23 00:55:00,4.7,4.7,g/dL,3.4,4.8,,STAT,"
## [7] "685,10000048,,82729055,50863,2126-11-22 20:45:00,2126-11-23 00:55:00,45,45,IU/L,39,117,,STAT,"
## [8] "686,10000048,,82729055,50868,2126-11-22 20:45:00,2126-11-22 21:32:00,13,13,mEq/L,8,20,,STAT,"
## [9] "687,10000048,,82729055,50878,2126-11-22 20:45:00,2126-11-23 00:55:00,28,28,IU/L,0,40,,STAT,"
## [10] "688,10000048,,82729055,50882,2126-11-22 20:45:00,2126-11-22 21:32:00,26,26,mEq/L,22,32,,STAT,"
d_labitems.csv.gz
is the dictionary of lab measurements.
system(
str_c(
"zcat < ",
shQuote(str_c(mimic_path, "/hosp/d_labitems.csv.gz")),
" | head"
),
intern = TRUE
)
## [1] "itemid,label,fluid,category,loinc_code"
## [2] "51905, ,Other Body Fluid,Chemistry,"
## [3] "51532,11-Deoxycorticosterone,Blood,Chemistry,"
## [4] "51957,17-Hydroxycorticosteroids,Urine,Chemistry,"
## [5] "51958,\"17-Ketosteroids, Urine\",Urine,Chemistry,"
## [6] "52068,24 Hr,Blood,Hematology,"
## [7] "51066,24 hr Calcium,Urine,Chemistry,"
## [8] "51067,24 hr Creatinine,Urine,Chemistry,"
## [9] "51068,24 hr Protein,Urine,Chemistry,"
## [10] "50853,25-OH Vitamin D,Blood,Chemistry,"
Find how many rows are in labevents.csv.gz
.
We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), glucose (50931), magnesium (50960), and calcium (50893). Retrieve a subset of labevents.csv.gz
only containing these items for the patients in icustays_tble
as a tibble labevents_tble
.
Hint: labevents.csv.gz
is a data file too big to be read in by the read_csv
function in its default setting. Utilize the col_select
and lazy
options in the read_csv
function to reduce the memory burden.
Further restrict labevents_tble
to the first lab measurement during the ICU stay.
Summarize the lab measurements by appropriate numerics and graphics.
chartevents.csv.gz
(https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid
variable indicates a single measurement type in the database. The value
variable is the value measured for itemid
. The first 10 lines of chartevents.csv.gz
are
system(
str_c(
"zcat < ",
shQuote(str_c(mimic_path, "/icu/chartevents.csv.gz")),
" | head"),
intern = TRUE
)
## [1] "subject_id,hadm_id,stay_id,charttime,storetime,itemid,value,valuenum,valueuom,warning"
## [2] "10003700,28623837,30600691,2165-04-24 05:10:00,2165-04-24 05:11:00,228236,0,0,,0"
## [3] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225067,0,0,,0"
## [4] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225070,1,1,,0"
## [5] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225076,1,1,,0"
## [6] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225078,1,1,,0"
## [7] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225086,1,1,,0"
## [8] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225091,1,1,,0"
## [9] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225103,1,1,,0"
## [10] "10003700,28623837,30600691,2165-04-24 05:12:00,2165-04-24 05:14:00,225106,1,1,,0"
d_items.csv.gz
(https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid
in chartevents.csv.gz
.
system(
str_c(
"zcat < ",
shQuote(str_c(mimic_path, "/icu/d_items.csv.gz")),
" | head"),
intern = TRUE
)
## [1] "itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue"
## [2] "220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,"
## [3] "220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,"
## [4] "220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,"
## [5] "220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,"
## [6] "220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,"
## [7] "220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140"
## [8] "220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90"
## [9] "220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,"
## [10] "220056,Arterial Blood Pressure Alarm - Low,ABP Alarm - Low,chartevents,Alarms,mmHg,Numeric,,"
We are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz
only containing these items for the patients in icustays_tble
as a tibble chartevents_tble
.
Further restrict chartevents_tble
to the first vital measurement during the ICU stay.
Summarize these vital measurements by appropriate numerics and graphics.
Let us create a tibble mimic_icu_cohort
for all ICU stays, where rows are
and columns contain at least following variables
icustays.csv.gz
admission.csv.gz
patients.csv.gz
thirty_day_mort
whether the patient died within 30 days of hospital admission (30 day mortality)Summarize following information using appropriate numerics or graphs.
thirty_day_mort
vs demographic variables (ethnicity, language, insurance, marital_status, gender, age at hospital admission)
thirty_day_mort
vs first lab measurements
thirty_day_mort
vs first vital measurements
thirty_day_mort
vs first ICU unit