Introduction

In this lecture we will learn:

Machine information

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.29   R6_2.5.1        jsonlite_1.7.2  magrittr_2.0.1 
##  [5] evaluate_0.14   stringi_1.7.6   rlang_1.0.1     cli_3.1.0      
##  [9] rstudioapi_0.13 jquerylib_0.1.4 bslib_0.3.1     rmarkdown_2.11 
## [13] tools_3.6.0     stringr_1.4.0   xfun_0.29       yaml_2.2.1     
## [17] fastmap_1.1.0   compiler_3.6.0  htmltools_0.5.2 knitr_1.37     
## [21] sass_0.4.0

Load necessary R packages

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(DBI)
library(RSQLite)
library("dbplyr")
## 
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
## 
##     ident, sql
# display version of SQLite
sqlite3 --version
## 3.7.17 2013-05-20 00:56:22 118a3b35693b134d56ebd780123b7fd6f1497668

CSV file

The /mnt/mimiciv/1.0 folder on the teaching server contains the MIMIC-IV data. On my Mac, it’s at /Users/huazhou/Documents/Box Sync/MIMIC/mimic-iv-1.0.

Display content of MIMIC-IV data files:

os <- sessionInfo()$running
if (str_detect(os, "Linux")) {
  mimic_path <- "/mnt/mimiciv/1.0"
} else if (str_detect(os, "macOS")) {
  mimic_path <- "/Users/huazhou/Documents/Box Sync/MIMIC/mimic-iv-1.0"
}
mimic_path
## [1] "/mnt/mimiciv/1.0"
system(str_c("tree -s -L 2 ", shQuote(mimic_path)), intern = TRUE)
##  [1] "/mnt/mimiciv/1.0"                                                                                             
##  [2] "├── [       2356]  biostat-203b-2022winter-3fdc2392ac39.json"                                                 
##  [3] "├── [        439]  client_secret_34349235653-sidv2drmetcimi1fba4tj024aqth8qmr.apps.googleusercontent.com.json"
##  [4] "├── [       4096]  core"                                                                                      
##  [5] "│   ├── [   17208966]  admissions.csv.gz"                                                                     
##  [6] "│   ├── [        606]  index.html"                                                                            
##  [7] "│   ├── [    2955582]  patients.csv.gz"                                                                       
##  [8] "│   └── [   53014503]  transfers.csv.gz"                                                                      
##  [9] "├── [       4096]  hosp"                                                                                      
## [10] "│   ├── [     430049]  d_hcpcs.csv.gz"                                                                        
## [11] "│   ├── [   29531152]  diagnoses_icd.csv.gz"                                                                  
## [12] "│   ├── [     863239]  d_icd_diagnoses.csv.gz"                                                                
## [13] "│   ├── [     579998]  d_icd_procedures.csv.gz"                                                               
## [14] "│   ├── [      14898]  d_labitems.csv.gz"                                                                     
## [15] "│   ├── [   11684062]  drgcodes.csv.gz"                                                                       
## [16] "│   ├── [  515763427]  emar.csv.gz"                                                                           
## [17] "│   ├── [  476252563]  emar_detail.csv.gz"                                                                    
## [18] "│   ├── [    2098831]  hcpcsevents.csv.gz"                                                                    
## [19] "│   ├── [       2325]  index.html"                                                                            
## [20] "│   ├── [ 2091865786]  labevents.csv.gz"                                                                      
## [21] "│   ├── [  171624288]  labevents_filtered_itemid.csv.gz"                                                      
## [22] "│   ├── [   99133381]  microbiologyevents.csv.gz"                                                             
## [23] "│   ├── [  422874088]  pharmacy.csv.gz"                                                                       
## [24] "│   ├── [  501381155]  poe.csv.gz"                                                                            
## [25] "│   ├── [   24020923]  poe_detail.csv.gz"                                                                     
## [26] "│   ├── [  367041717]  prescriptions.csv.gz"                                                                  
## [27] "│   ├── [    7750325]  procedures_icd.csv.gz"                                                                 
## [28] "│   └── [    9565293]  services.csv.gz"                                                                       
## [29] "├── [       4096]  icu"                                                                                       
## [30] "│   ├── [ 2350783547]  chartevents.csv.gz"                                                                    
## [31] "│   ├── [  110272408]  chartevents_filtered_itemid.csv.gz"                                                    
## [32] "│   ├── [   43296273]  datetimeevents.csv.gz"                                                                 
## [33] "│   ├── [      55917]  d_items.csv.gz"                                                                        
## [34] "│   ├── [    2848628]  icustays.csv.gz"                                                                       
## [35] "│   ├── [       1103]  index.html"                                                                            
## [36] "│   ├── [  352443512]  inputevents.csv.gz"                                                                    
## [37] "│   ├── [   37095672]  outputevents.csv.gz"                                                                   
## [38] "│   └── [   20567368]  procedureevents.csv.gz"                                                                
## [39] "├── [        797]  index.html"                                                                                
## [40] "├── [       2518]  LICENSE.txt"                                                                               
## [41] "└── [       2459]  SHA256SUMS.txt"                                                                            
## [42] ""                                                                                                             
## [43] "3 directories, 37 files"

Read CSVs and deposit to an SQLite database

Here, we will import only one csv file icustays.csv.gz for demonstration purpose. Motivated students can write a bash script for loading all MIMIC-IV data files into a SQLite database and contribute to https://github.com/MIT-LCP/mimic-code.

Create an empty database file mimiciv.sqlite:

cmd <- 'touch mimiciv.sqlite'
system(cmd, intern = TRUE)
## character(0)

Deposit the icu/icustatys.csv.gz file:

# delete icustays table if exists
cmd <- "sqlite3 mimiciv.sqlite 'DROP TABLE IF EXISTS icustays;'"
system(cmd, intern = TRUE)
## character(0)

Create an empty icustays table with data types. Because SQLite does not support date-time data type (https://www.sqlite.org/datatype3.html), we store intime and outtime as TEXT.

cmd <- "sqlite3 mimiciv.sqlite 'CREATE TABLE icustays (subject_id INTEGER, hadm_id INTEGER, stay_id INTEGER, first_careunit TEXT, last_careunit TEXT, intime TEXT, outtime TEXT, los REAL)'"
system(cmd, intern = TRUE)
## character(0)
csvfile <- str_c(mimic_path, "/icu/icustays.csv.gz")
cmd <- str_c(
  "zcat < ",
  shQuote(csvfile),
  " | tail -n +2 | ",
  "sqlite3 mimiciv.sqlite -csv ",
  "'.import /dev/stdin icustays'"
)
cmd
## [1] "zcat < '/mnt/mimiciv/1.0/icu/icustays.csv.gz' | tail -n +2 | sqlite3 mimiciv.sqlite -csv '.import /dev/stdin icustays'"

Execute the bash command:

system(cmd, intern = TRUE)
## character(0)

Read data from database

Connect to the database mimiciii.sqlite and list the tables:

con <- dbConnect(RSQLite::SQLite(), 
                 dbname = "./mimiciv.sqlite"
                 )
dbListTables(con)
## [1] "icustays"

Read the table icustays:

icustays_tble <- tbl(con, "icustays") %>% 
  print(width = Inf)
## # Source:   table<icustays> [?? x 8]
## # Database: sqlite 3.37.0
## #   [/home/huazhou/203b-2022winter/slides/12-dbplyr/mimiciv.sqlite]
##    subject_id  hadm_id  stay_id first_careunit          
##         <int>    <int>    <int> <chr>                   
##  1   17867402 24528534 31793211 Trauma SICU (TSICU)     
##  2   14435996 28960964 31983544 Trauma SICU (TSICU)     
##  3   17609946 27385897 33183475 Trauma SICU (TSICU)     
##  4   18966770 23483021 34131444 Trauma SICU (TSICU)     
##  5   12776735 20817525 34547665 Neuro Stepdown          
##  6   10215159 24283593 34569476 Trauma SICU (TSICU)     
##  7   14489052 26516390 35056286 Trauma SICU (TSICU)     
##  8   15914763 28906020 36909804 Trauma SICU (TSICU)     
##  9   16256226 20013290 39289362 Neuro Stepdown          
## 10   19194449 21641999 39387567 Coronary Care Unit (CCU)
##    last_careunit            intime              outtime               los
##    <chr>                    <chr>               <chr>               <dbl>
##  1 Trauma SICU (TSICU)      2154-03-03 04:11:00 2154-03-04 18:16:56 1.59 
##  2 Trauma SICU (TSICU)      2150-06-19 17:57:00 2150-06-22 18:33:54 3.03 
##  3 Trauma SICU (TSICU)      2138-02-05 18:54:00 2138-02-15 12:42:05 9.74 
##  4 Trauma SICU (TSICU)      2123-10-25 10:35:00 2123-10-25 18:59:47 0.351
##  5 Neuro Stepdown           2200-07-12 00:33:00 2200-07-13 16:44:40 1.67 
##  6 Trauma SICU (TSICU)      2124-09-20 15:05:29 2124-09-21 22:06:58 1.29 
##  7 Trauma SICU (TSICU)      2118-10-26 10:33:56 2118-10-26 20:28:10 0.413
##  8 Trauma SICU (TSICU)      2176-12-14 12:00:00 2176-12-17 11:47:01 2.99 
##  9 Neuro Stepdown           2150-12-20 16:09:08 2150-12-21 14:58:40 0.951
## 10 Coronary Care Unit (CCU) 2123-11-12 02:53:35 2123-11-12 13:52:03 0.457
## # … with more rows

How many rows?

icustays_tble %>% 
  show_query() %>%
  summarise(n = n())
## <SQL>
## SELECT *
## FROM `icustays`
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.37.0
## #   [/home/huazhou/203b-2022winter/slides/12-dbplyr/mimiciv.sqlite]
##       n
##   <int>
## 1 76540

Use dplyr with SQLite

Keep the first ICU stay for each patient:

icustays_subset <- icustays_tble %>%
  # first ICU stay of each unique `subject_id`
  group_by(subject_id) %>%
  slice_min(intime) %>%
  ungroup() %>%
  # arrange(intime, .by_group = TRUE) %>%
  # slice_head(n = 1) %>%
  # left_join(icustays_tble, by = c("subject_id", "intime")) %>%
  show_query() %>%
  print(width = Inf)
## <SQL>
## SELECT `subject_id`, `hadm_id`, `stay_id`, `first_careunit`, `last_careunit`, `intime`, `outtime`, `los`
## FROM (SELECT `subject_id`, `hadm_id`, `stay_id`, `first_careunit`, `last_careunit`, `intime`, `outtime`, `los`, RANK() OVER (PARTITION BY `subject_id` ORDER BY `intime`) AS `q01`
## FROM `icustays`)
## WHERE (`q01` <= 1)
## # Source:     lazy query [?? x 8]
## # Database:   sqlite 3.37.0
## #   [/home/huazhou/203b-2022winter/slides/12-dbplyr/mimiciv.sqlite]
## # Ordered by: intime
##    subject_id  hadm_id  stay_id first_careunit                                  
##         <int>    <int>    <int> <chr>                                           
##  1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
##  2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
##  3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
##  4   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
##  5   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
##  6   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
##  7   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
##  8   10002223 22494570 39638202 Trauma SICU (TSICU)                             
##  9   10002348 22725460 32610785 Neuro Intermediate                              
## 10   10002428 28662225 33987268 Medical Intensive Care Unit (MICU)              
##    last_careunit                                    intime             
##    <chr>                                            <chr>              
##  1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
##  2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
##  3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
##  4 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
##  5 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
##  6 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
##  7 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
##  8 Trauma SICU (TSICU)                              2158-01-15 08:01:49
##  9 Neuro Intermediate                               2112-11-30 23:24:00
## 10 Medical Intensive Care Unit (MICU)               2156-04-12 16:24:18
##    outtime               los
##    <chr>               <dbl>
##  1 2180-07-23 23:50:47 0.410
##  2 2189-06-27 20:38:27 0.498
##  3 2157-11-21 22:08:00 1.12 
##  4 2110-04-12 23:59:56 1.34 
##  5 2131-01-20 08:27:30 9.17 
##  6 2160-05-19 17:33:33 1.31 
##  7 2129-08-10 17:02:38 6.18 
##  8 2158-01-16 15:19:24 1.30 
##  9 2112-12-10 18:25:13 9.79 
## 10 2156-04-17 15:57:08 4.98 
## # … with more rows

How many rows in icustays_subset?

icustays_subset %>%
  show_query() %>%
  summarise(n = n())
## <SQL>
## SELECT `subject_id`, `hadm_id`, `stay_id`, `first_careunit`, `last_careunit`, `intime`, `outtime`, `los`
## FROM (SELECT `subject_id`, `hadm_id`, `stay_id`, `first_careunit`, `last_careunit`, `intime`, `outtime`, `los`, RANK() OVER (PARTITION BY `subject_id` ORDER BY `intime`) AS `q01`
## FROM `icustays`)
## WHERE (`q01` <= 1)
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.37.0
## #   [/home/huazhou/203b-2022winter/slides/12-dbplyr/mimiciv.sqlite]
##       n
##   <int>
## 1 53150

SQL query

show_query usefully shows the SQL query translated from dplyr query.

class(icustays_subset)
## [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
## [4] "tbl_lazy"             "tbl"
show_query(icustays_subset)
## <SQL>
## SELECT `subject_id`, `hadm_id`, `stay_id`, `first_careunit`, `last_careunit`, `intime`, `outtime`, `los`
## FROM (SELECT `subject_id`, `hadm_id`, `stay_id`, `first_careunit`, `last_careunit`, `intime`, `outtime`, `los`, RANK() OVER (PARTITION BY `subject_id` ORDER BY `intime`) AS `q01`
## FROM `icustays`)
## WHERE (`q01` <= 1)

Transform in database, plot in R

icustays_tble %>%
  group_by(subject_id) %>%
  summarise(n = n()) %>%
  ggplot() +
  geom_bar(mapping = aes(x = n)) + 
  labs(x = "# ICU stays of a patient")

SQL translation

dbplyr package (a dplyr backend for databases) has a function, translate_sql, that lets you experiment with how R functions are translated to SQL:

translate_sql(x == 1 & (y < 2 | z > 3))
## <SQL> `x` = 1.0 AND (`y` < 2.0 OR `z` > 3.0)
translate_sql(x ^ 2 < 10)
## <SQL> POWER(`x`, 2.0) < 10.0
translate_sql(x %% 2 == 10)
## <SQL> `x` % 2.0 = 10.0
translate_sql(paste(x, y))
## <SQL> CONCAT_WS(' ', `x`, `y`)
translate_sql(mean(x))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## <SQL> AVG(`x`) OVER ()
translate_sql(mean(x, na.rm = TRUE))
## <SQL> AVG(`x`) OVER ()

Timings

Let’s compare the timings of dplyr (in-memory) and dbplyr (on disk database).

csvfile <- str_c(mimic_path, "/icu/icustays.csv.gz")
icustays_tibble <- read_csv(csvfile)
## Rows: 76540 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): first_careunit, last_careunit
## dbl  (4): subject_id, hadm_id, stay_id, los
## dttm (2): intime, outtime
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
timing_tibble <- system.time(
  icustays_tibble %>%
    group_by(subject_id) %>%
    summarize(n = n())
)
timing_tibble
##    user  system elapsed 
##   1.442   0.001   1.444
icustays_sql <- tbl(con, "icustays")
timing_sql <- system.time(
  icustays_sql %>%
    group_by(subject_id) %>%
    summarize(n = n())
)
timing_sql
##    user  system elapsed 
##   0.002   0.000   0.002

SQLite (0.002 seconds) was much faster than tibble (1.444 seconds). But SQLite is disk-based, while the tibble is in memory. Why is the discrepancy?

Laziness

dplyr/dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.

icustays_sql %>%
  group_by(subject_id) %>%
  summarize(n = n())
## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.37.0
## #   [/home/huazhou/203b-2022winter/slides/12-dbplyr/mimiciv.sqlite]
##    subject_id     n
##         <int> <int>
##  1   10000032     1
##  2   10000980     1
##  3   10001217     2
##  4   10001725     1
##  5   10001884     1
##  6   10002013     1
##  7   10002155     3
##  8   10002223     1
##  9   10002348     1
## 10   10002428     4
## # … with more rows

Full query

To force a full query and return a complete table it is necessary to use the collect function.

system.time(
  icustays_sql %>%
    group_by(subject_id) %>%
    summarize(n = n()) %>%
    collect()
)
##    user  system elapsed 
##   0.124   0.028   0.153

Old ggplot2 doesn’t do ``Transform in database, plot in R”

Earlier we see dbplyr connects with ggplot2 (v3.3.5) seamlessly. Remember ggplot2 needs to collect the query results for plotting.

icustays_sql %>%
  count(subject_id) %>%
  ggplot() + 
  geom_boxplot(mapping = aes(y = n))

Older version of ggplot2, e.g., v2.2.1, will output error. This is because ggplot2 needed to compute the count per bin by going through all the rows. But here icustays_sql is just a pointer to the SQLite table. We had to use the transform in database, plot in R strategy.

icustays_sql %>%
  count(subject_id) %>%
  collect() %>%
  ggplot() + 
  geom_boxplot(mapping = aes(y = n))

Close connection to database

dbDisconnect(con)