Code in this lecture assumes that the MIMIC-IV data is available at ~/mimic.
Display content of MIMIC-IV data folder:
ls-l ~/mimic
total 24
-rw-r--r-- 1 504 rstudio 13332 Jan 5 2023 CHANGELOG.txt
drwxr-xr-x 25 504 rstudio 800 Jan 17 21:41 hosp
drwxr-xr-x 11 504 rstudio 352 Nov 24 19:34 icu
-rw-r--r-- 1 504 rstudio 2518 Jan 6 2023 LICENSE.txt
-rw-r--r-- 1 504 rstudio 2884 Jan 6 2023 SHA256SUMS.txt
4 Read CSVs and deposit to an SQLite database
Here, we will import only one csv file icustays.csv.gz for demonstration purpose. Motivated students can write a Bash script for loading all MIMIC-IV data files into a SQLite database and contribute to https://github.com/MIT-LCP/mimic-code.
Create an empty database file mimiciv.sqlite:
# bash commandtouch mimiciv.sqlite
Deposit the icu/icustatys.csv.gz file:
# bash command# delete icustays table if existssqlite3 mimiciv.sqlite 'DROP TABLE IF EXISTS icustays;'
Create an empty icustays table with data types. Because SQLite does not support date-time data type (https://www.sqlite.org/datatype3.html), we store intime and outtime as TEXT.
# Source: table<icustays> [?? x 8]
# Database: sqlite 3.44.2 [/home/rstudio/203b-lecture/slides/12-dbplyr/mimiciv.sqlite]
subject_id hadm_id stay_id first_careunit
<int> <int> <int> <chr>
1 10000032 29079034 39553978 Medical Intensive Care Unit (MICU)
2 10000980 26913865 39765666 Medical Intensive Care Unit (MICU)
3 10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)
4 10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)
5 10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
6 10001884 26184834 37510196 Medical Intensive Care Unit (MICU)
7 10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)
8 10002155 20345487 32358465 Medical Intensive Care Unit (MICU)
9 10002155 23822395 33685454 Coronary Care Unit (CCU)
10 10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
last_careunit intime
<chr> <chr>
1 Medical Intensive Care Unit (MICU) 2180-07-23 14:00:00
2 Medical Intensive Care Unit (MICU) 2189-06-27 08:42:00
3 Surgical Intensive Care Unit (SICU) 2157-11-20 19:18:02
4 Surgical Intensive Care Unit (SICU) 2157-12-19 15:42:24
5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
6 Medical Intensive Care Unit (MICU) 2131-01-11 04:20:05
7 Cardiac Vascular Intensive Care Unit (CVICU) 2160-05-18 10:00:53
8 Medical Intensive Care Unit (MICU) 2131-03-09 21:33:00
9 Coronary Care Unit (CCU) 2129-08-04 12:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
outtime los
<chr> <dbl>
1 2180-07-23 23:50:47 0.410
2 2189-06-27 20:38:27 0.498
3 2157-11-21 22:08:00 1.12
4 2157-12-20 14:27:41 0.948
5 2110-04-12 23:59:56 1.34
6 2131-01-20 08:27:30 9.17
7 2160-05-19 17:33:33 1.31
8 2131-03-10 18:09:21 0.859
9 2129-08-10 17:02:38 6.18
10 2130-09-27 22:13:41 3.89
# ℹ more rows
<SQL>
SELECT COUNT(*) AS `n`
FROM `icustays`
# Source: SQL [1 x 1]
# Database: sqlite 3.44.2 [/home/rstudio/203b-lecture/slides/12-dbplyr/mimiciv.sqlite]
n
<int>
1 73181
6 Use dplyr with SQLite
Keep the first ICU stay for each patient:
icustays_subset <- icustays_tble |># first ICU stay of each unique `subject_id`group_by(subject_id) |>slice_min(intime) |>ungroup() |># arrange(intime, .by_group = TRUE) |># slice_head(n = 1) |># left_join(icustays_tble, by = c("subject_id", "intime")) |>show_query() |>print(width =Inf)
<SQL>
SELECT
`subject_id`,
`hadm_id`,
`stay_id`,
`first_careunit`,
`last_careunit`,
`intime`,
`outtime`,
`los`
FROM (
SELECT
`icustays`.*,
RANK() OVER (PARTITION BY `subject_id` ORDER BY `intime`) AS `col01`
FROM `icustays`
) AS `q01`
WHERE (`col01` <= 1)
# Source: SQL [?? x 8]
# Database: sqlite 3.44.2 [/home/rstudio/203b-lecture/slides/12-dbplyr/mimiciv.sqlite]
subject_id hadm_id stay_id first_careunit
<int> <int> <int> <chr>
1 10000032 29079034 39553978 Medical Intensive Care Unit (MICU)
2 10000980 26913865 39765666 Medical Intensive Care Unit (MICU)
3 10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)
4 10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
5 10001884 26184834 37510196 Medical Intensive Care Unit (MICU)
6 10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)
7 10002155 23822395 33685454 Coronary Care Unit (CCU)
8 10002348 22725460 32610785 Neuro Intermediate
9 10002428 28662225 33987268 Medical Intensive Care Unit (MICU)
10 10002430 26295318 38392119 Coronary Care Unit (CCU)
last_careunit intime
<chr> <chr>
1 Medical Intensive Care Unit (MICU) 2180-07-23 14:00:00
2 Medical Intensive Care Unit (MICU) 2189-06-27 08:42:00
3 Surgical Intensive Care Unit (SICU) 2157-11-20 19:18:02
4 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
5 Medical Intensive Care Unit (MICU) 2131-01-11 04:20:05
6 Cardiac Vascular Intensive Care Unit (CVICU) 2160-05-18 10:00:53
7 Coronary Care Unit (CCU) 2129-08-04 12:45:00
8 Neuro Intermediate 2112-11-30 23:24:00
9 Medical Intensive Care Unit (MICU) 2156-04-12 16:24:18
10 Coronary Care Unit (CCU) 2129-06-13 00:43:08
outtime los
<chr> <dbl>
1 2180-07-23 23:50:47 0.410
2 2189-06-27 20:38:27 0.498
3 2157-11-21 22:08:00 1.12
4 2110-04-12 23:59:56 1.34
5 2131-01-20 08:27:30 9.17
6 2160-05-19 17:33:33 1.31
7 2129-08-10 17:02:38 6.18
8 2112-12-10 18:25:13 9.79
9 2156-04-17 15:57:08 4.98
10 2129-06-15 22:51:40 2.92
# ℹ more rows
<SQL>
SELECT COUNT(*) AS `n`
FROM (
SELECT
`subject_id`,
`hadm_id`,
`stay_id`,
`first_careunit`,
`last_careunit`,
`intime`,
`outtime`,
`los`
FROM (
SELECT
`icustays`.*,
RANK() OVER (PARTITION BY `subject_id` ORDER BY `intime`) AS `col01`
FROM `icustays`
) AS `q01`
WHERE (`col01` <= 1)
) AS `q01`
# Source: SQL [1 x 1]
# Database: sqlite 3.44.2 [/home/rstudio/203b-lecture/slides/12-dbplyr/mimiciv.sqlite]
n
<int>
1 50920
7 SQL query
show_query usefully shows the SQL query translated from dplyr query.
<SQL>
SELECT
`subject_id`,
`hadm_id`,
`stay_id`,
`first_careunit`,
`last_careunit`,
`intime`,
`outtime`,
`los`
FROM (
SELECT
`icustays`.*,
RANK() OVER (PARTITION BY `subject_id` ORDER BY `intime`) AS `col01`
FROM `icustays`
) AS `q01`
WHERE (`col01` <= 1)
8 Transform in database, plot in R
ggplot will compute the plot in R, but the data transformation is done in the database. Always make sure to do data wrangling (filter, select, group by, summarise) in the database if possible.
icustays_tble |>group_by(subject_id) |>summarise(n =n()) |>ggplot() +geom_bar(mapping =aes(x = n)) +labs(x ="# ICU stays of a patient")
9 SQL translation
dbplyr package (a dplyr backend for databases) has a function, translate_sql, that lets you experiment with how R functions are translated to SQL:
BTW, as modern data scientists, we should all start using DuckDB (https://duckdb.org/) instead of SQLite. DuckDB is a modern, embeddable SQL OLAP database management system. It is designed to handle analytical workloads (OLAP) on read-only data. It is based on a column-store architecture and is designed to be very fast to query, highly compressible, and run on modern hardware. It is a great alternative to SQLite for analytical workloads.
15 DuckDB vs BigQuery
Let’s now compare DuckDB with BigQuery. We will use the MIMIC-IV data for this comparison: calculate the average measurements for 5 vitals: 220045 (heart rate), 220179 (systolic BP), 220180 (diastolic BP), 223761 (respiratory rate), and 220210 (temperature in F).
open_dataset(sources ="chartevents_pq", format ="parquet" ) |># create a virtual table in DuckDBto_duckdb() |># filter rowsfilter(itemid %in%c(220045L, 220179L, 220180L, 223761L, 220210L)) |># group by and countgroup_by(itemid) |>summarise(mean =mean(valuenum, na.rm =TRUE)) |># force computationcollect() |># pull item infoleft_join(read_csv("~/mimic/icu/d_items.csv.gz", show_col_types =FALSE) |>select(itemid, label),by ="itemid" ) |>arrange(itemid) |># print resultsprint() |>system.time()
# A tibble: 5 × 3
itemid mean label
<dbl> <dbl> <chr>
1 220045 87.8 Heart Rate
2 220179 121. Non Invasive Blood Pressure systolic
3 220180 66.4 Non Invasive Blood Pressure diastolic
4 220210 20.5 Respiratory Rate
5 223761 98.7 Temperature Fahrenheit
user system elapsed
10.761 2.707 3.480
Achieving this task under 4 seconds is impressive.
How about BigQuery? We use the bigrquery package to connect to BigQuery and run the same query. BigQuery is able to achieve this task under 3 seconds. But this may not be fair. BigQuery may cache this particular query if it’s been run before.
# BigQuery authentication using service account tokenbq_auth(path ="biostat-203b-2024-winter-313290ce47a6.json")
# Connect to the BigQuery database `biostat-203b-2024-winter.mimic4_v2_2`con_bq <-dbConnect( bigrquery::bigquery(),project ="biostat-203b-2024-winter",dataset ="mimic4_v2_2",billing ="biostat-203b-2024-winter")dbListTables(con_bq)