Working with Databases

Introduction
Machine information
CSV file
Read CSVs and deposit to an SQLite database
Read data from database
Use dplyr with SQLite
SQL query
Transform in database, plot in R
SQL translation
Timings
Laziness
Full query
Old ggplot2 doesn’t do ``Transform in database, plot in R”
Close connection to database

Introduction

In this lecture we will learn:

Import data from bigish csv files (MIMIC-IV).
Deposit data into an SQLite database.
Query SQLite database.
Transform in database and plot in R.

Machine information

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.29   R6_2.5.1        jsonlite_1.7.2  magrittr_2.0.1 
##  [5] evaluate_0.14   stringi_1.7.6   rlang_1.0.1     cli_3.1.0      
##  [9] rstudioapi_0.13 jquerylib_0.1.4 bslib_0.3.1     rmarkdown_2.11 
## [13] tools_3.6.0     stringr_1.4.0   xfun_0.29       yaml_2.2.1     
## [17] fastmap_1.1.0   compiler_3.6.0  htmltools_0.5.2 knitr_1.37     
## [21] sass_0.4.0

Load necessary R packages

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(DBI)
library(RSQLite)
library("dbplyr")

## 
## Attaching package: 'dbplyr'

## The following objects are masked from 'package:dplyr':
## 
##     ident, sql

# display version of SQLite
sqlite3 --version

## 3.7.17 2013-05-20 00:56:22 118a3b35693b134d56ebd780123b7fd6f1497668

CSV file

The /mnt/mimiciv/1.0 folder on the teaching server contains the MIMIC-IV data. On my Mac, it’s at /Users/huazhou/Documents/Box Sync/MIMIC/mimic-iv-1.0.

Display content of MIMIC-IV data files:

os <- sessionInfo()$running
if (str_detect(os, "Linux")) {
  mimic_path <- "/mnt/mimiciv/1.0"
} else if (str_detect(os, "macOS")) {
  mimic_path <- "/Users/huazhou/Documents/Box Sync/MIMIC/mimic-iv-1.0"
}
mimic_path

## [1] "/mnt/mimiciv/1.0"

system(str_c("tree -s -L 2 ", shQuote(mimic_path)), intern = TRUE)

##  [1] "/mnt/mimiciv/1.0"                                                                                             
##  [2] "├── [       2356]  biostat-203b-2022winter-3fdc2392ac39.json"                                                 
##  [3] "├── [        439]  client_secret_34349235653-sidv2drmetcimi1fba4tj024aqth8qmr.apps.googleusercontent.com.json"
##  [4] "├── [       4096]  core"                                                                                      
##  [5] "│   ├── [   17208966]  admissions.csv.gz"                                                                     
##  [6] "│   ├── [        606]  index.html"                                                                            
##  [7] "│   ├── [    2955582]  patients.csv.gz"                                                                       
##  [8] "│   └── [   53014503]  transfers.csv.gz"                                                                      
##  [9] "├── [       4096]  hosp"                                                                                      
## [10] "│   ├── [     430049]  d_hcpcs.csv.gz"                                                                        
## [11] "│   ├── [   29531152]  diagnoses_icd.csv.gz"                                                                  
## [12] "│   ├── [     863239]  d_icd_diagnoses.csv.gz"                                                                
## [13] "│   ├── [     579998]  d_icd_procedures.csv.gz"                                                               
## [14] "│   ├── [      14898]  d_labitems.csv.gz"                                                                     
## [15] "│   ├── [   11684062]  drgcodes.csv.gz"                                                                       
## [16] "│   ├── [  515763427]  emar.csv.gz"                                                                           
## [17] "│   ├── [  476252563]  emar_detail.csv.gz"                                                                    
## [18] "│   ├── [    2098831]  hcpcsevents.csv.gz"                                                                    
## [19] "│   ├── [       2325]  index.html"                                                                            
## [20] "│   ├── [ 2091865786]  labevents.csv.gz"                                                                      
## [21] "│   ├── [  171624288]  labevents_filtered_itemid.csv.gz"                                                      
## [22] "│   ├── [   99133381]  microbiologyevents.csv.gz"                                                             
## [23] "│   ├── [  422874088]  pharmacy.csv.gz"                                                                       
## [24] "│   ├── [  501381155]  poe.csv.gz"                                                                            
## [25] "│   ├── [   24020923]  poe_detail.csv.gz"                                                                     
## [26] "│   ├── [  367041717]  prescriptions.csv.gz"                                                                  
## [27] "│   ├── [    7750325]  procedures_icd.csv.gz"                                                                 
## [28] "│   └── [    9565293]  services.csv.gz"                                                                       
## [29] "├── [       4096]  icu"                                                                                       
## [30] "│   ├── [ 2350783547]  chartevents.csv.gz"                                                                    
## [31] "│   ├── [  110272408]  chartevents_filtered_itemid.csv.gz"                                                    
## [32] "│   ├── [   43296273]  datetimeevents.csv.gz"                                                                 
## [33] "│   ├── [      55917]  d_items.csv.gz"                                                                        
## [34] "│   ├── [    2848628]  icustays.csv.gz"                                                                       
## [35] "│   ├── [       1103]  index.html"                                                                            
## [36] "│   ├── [  352443512]  inputevents.csv.gz"                                                                    
## [37] "│   ├── [   37095672]  outputevents.csv.gz"                                                                   
## [38] "│   └── [   20567368]  procedureevents.csv.gz"                                                                
## [39] "├── [        797]  index.html"                                                                                
## [40] "├── [       2518]  LICENSE.txt"                                                                               
## [41] "└── [       2459]  SHA256SUMS.txt"                                                                            
## [42] ""                                                                                                             
## [43] "3 directories, 37 files"

Read CSVs and deposit to an SQLite database

Here, we will import only one csv file icustays.csv.gz for demonstration purpose. Motivated students can write a bash script for loading all MIMIC-IV data files into a SQLite database and contribute to https://github.com/MIT-LCP/mimic-code.

Create an empty database file mimiciv.sqlite:

cmd <- 'touch mimiciv.sqlite'
system(cmd, intern = TRUE)

## character(0)

Deposit the icu/icustatys.csv.gz file:

# delete icustays table if exists
cmd <- "sqlite3 mimiciv.sqlite 'DROP TABLE IF EXISTS icustays;'"
system(cmd, intern = TRUE)

## character(0)

Create an empty icustays table with data types. Because SQLite does not support date-time data type (https://www.sqlite.org/datatype3.html), we store intime and outtime as TEXT.

cmd <- "sqlite3 mimiciv.sqlite 'CREATE TABLE icustays (subject_id INTEGER, hadm_id INTEGER, stay_id INTEGER, first_careunit TEXT, last_careunit TEXT, intime TEXT, outtime TEXT, los REAL)'"
system(cmd, intern = TRUE)

## character(0)

csvfile <- str_c(mimic_path, "/icu/icustays.csv.gz")
cmd <- str_c(
  "zcat < ",
  shQuote(csvfile),
  " | tail -n +2 | ",
  "sqlite3 mimiciv.sqlite -csv ",
  "'.import /dev/stdin icustays'"
)
cmd

## [1] "zcat < '/mnt/mimiciv/1.0/icu/icustays.csv.gz' | tail -n +2 | sqlite3 mimiciv.sqlite -csv '.import /dev/stdin icustays'"

Execute the bash command:

system(cmd, intern = TRUE)

## character(0)

Read data from database

Connect to the database mimiciii.sqlite and list the tables:

con <- dbConnect(RSQLite::SQLite(), 
                 dbname = "./mimiciv.sqlite"
                 )
dbListTables(con)

## [1] "icustays"

Read the table icustays:

icustays_tble <- tbl(con, "icustays") %>% 
  print(width = Inf)

## # Source:   table<icustays> [?? x 8]
## # Database: sqlite 3.37.0
## #   [/home/huazhou/203b-2022winter/slides/12-dbplyr/mimiciv.sqlite]
##    subject_id  hadm_id  stay_id first_careunit          
##         <int>    <int>    <int> <chr>                   
##  1   17867402 24528534 31793211 Trauma SICU (TSICU)     
##  2   14435996 28960964 31983544 Trauma SICU (TSICU)     
##  3   17609946 27385897 33183475 Trauma SICU (TSICU)     
##  4   18966770 23483021 34131444 Trauma SICU (TSICU)     
##  5   12776735 20817525 34547665 Neuro Stepdown          
##  6   10215159 24283593 34569476 Trauma SICU (TSICU)     
##  7   14489052 26516390 35056286 Trauma SICU (TSICU)     
##  8   15914763 28906020 36909804 Trauma SICU (TSICU)     
##  9   16256226 20013290 39289362 Neuro Stepdown          
## 10   19194449 21641999 39387567 Coronary Care Unit (CCU)
##    last_careunit            intime              outtime               los
##    <chr>                    <chr>               <chr>               <dbl>
##  1 Trauma SICU (TSICU)      2154-03-03 04:11:00 2154-03-04 18:16:56 1.59 
##  2 Trauma SICU (TSICU)      2150-06-19 17:57:00 2150-06-22 18:33:54 3.03 
##  3 Trauma SICU (TSICU)      2138-02-05 18:54:00 2138-02-15 12:42:05 9.74 
##  4 Trauma SICU (TSICU)      2123-10-25 10:35:00 2123-10-25 18:59:47 0.351
##  5 Neuro Stepdown           2200-07-12 00:33:00 2200-07-13 16:44:40 1.67 
##  6 Trauma SICU (TSICU)      2124-09-20 15:05:29 2124-09-21 22:06:58 1.29 
##  7 Trauma SICU (TSICU)      2118-10-26 10:33:56 2118-10-26 20:28:10 0.413
##  8 Trauma SICU (TSICU)      2176-12-14 12:00:00 2176-12-17 11:47:01 2.99 
##  9 Neuro Stepdown           2150-12-20 16:09:08 2150-12-21 14:58:40 0.951
## 10 Coronary Care Unit (CCU) 2123-11-12 02:53:35 2123-11-12 13:52:03 0.457
## # … with more rows

How many rows?

icustays_tble %>% 
  show_query() %>%
  summarise(n = n())

## <SQL>
## SELECT *
## FROM `icustays`

## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.37.0
## #   [/home/huazhou/203b-2022winter/slides/12-dbplyr/mimiciv.sqlite]
##       n
##   <int>
## 1 76540

Use dplyr with SQLite

Keep the first ICU stay for each patient:

icustays_subset <- icustays_tble %>%
  # first ICU stay of each unique `subject_id`
  group_by(subject_id) %>%
  slice_min(intime) %>%
  ungroup() %>%
  # arrange(intime, .by_group = TRUE) %>%
  # slice_head(n = 1) %>%
  # left_join(icustays_tble, by = c("subject_id", "intime")) %>%
  show_query() %>%
  print(width = Inf)

## <SQL>
## SELECT `subject_id`, `hadm_id`, `stay_id`, `first_careunit`, `last_careunit`, `intime`, `outtime`, `los`
## FROM (SELECT `subject_id`, `hadm_id`, `stay_id`, `first_careunit`, `last_careunit`, `intime`, `outtime`, `los`, RANK() OVER (PARTITION BY `subject_id` ORDER BY `intime`) AS `q01`
## FROM `icustays`)
## WHERE (`q01` <= 1)
## # Source:     lazy query [?? x 8]
## # Database:   sqlite 3.37.0
## #   [/home/huazhou/203b-2022winter/slides/12-dbplyr/mimiciv.sqlite]
## # Ordered by: intime
##    subject_id  hadm_id  stay_id first_careunit                                  
##         <int>    <int>    <int> <chr>                                           
##  1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
##  2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
##  3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
##  4   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
##  5   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
##  6   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
##  7   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
##  8   10002223 22494570 39638202 Trauma SICU (TSICU)                             
##  9   10002348 22725460 32610785 Neuro Intermediate                              
## 10   10002428 28662225 33987268 Medical Intensive Care Unit (MICU)              
##    last_careunit                                    intime             
##    <chr>                                            <chr>              
##  1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
##  2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
##  3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
##  4 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
##  5 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
##  6 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
##  7 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
##  8 Trauma SICU (TSICU)                              2158-01-15 08:01:49
##  9 Neuro Intermediate                               2112-11-30 23:24:00
## 10 Medical Intensive Care Unit (MICU)               2156-04-12 16:24:18
##    outtime               los
##    <chr>               <dbl>
##  1 2180-07-23 23:50:47 0.410
##  2 2189-06-27 20:38:27 0.498
##  3 2157-11-21 22:08:00 1.12 
##  4 2110-04-12 23:59:56 1.34 
##  5 2131-01-20 08:27:30 9.17 
##  6 2160-05-19 17:33:33 1.31 
##  7 2129-08-10 17:02:38 6.18 
##  8 2158-01-16 15:19:24 1.30 
##  9 2112-12-10 18:25:13 9.79 
## 10 2156-04-17 15:57:08 4.98 
## # … with more rows

How many rows in icustays_subset?

icustays_subset %>%
  show_query() %>%
  summarise(n = n())

## <SQL>
## SELECT `subject_id`, `hadm_id`, `stay_id`, `first_careunit`, `last_careunit`, `intime`, `outtime`, `los`
## FROM (SELECT `subject_id`, `hadm_id`, `stay_id`, `first_careunit`, `last_careunit`, `intime`, `outtime`, `los`, RANK() OVER (PARTITION BY `subject_id` ORDER BY `intime`) AS `q01`
## FROM `icustays`)
## WHERE (`q01` <= 1)

## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.37.0
## #   [/home/huazhou/203b-2022winter/slides/12-dbplyr/mimiciv.sqlite]
##       n
##   <int>
## 1 53150

SQL query

show_query usefully shows the SQL query translated from dplyr query.

class(icustays_subset)

## [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
## [4] "tbl_lazy"             "tbl"

show_query(icustays_subset)

## <SQL>
## SELECT `subject_id`, `hadm_id`, `stay_id`, `first_careunit`, `last_careunit`, `intime`, `outtime`, `los`
## FROM (SELECT `subject_id`, `hadm_id`, `stay_id`, `first_careunit`, `last_careunit`, `intime`, `outtime`, `los`, RANK() OVER (PARTITION BY `subject_id` ORDER BY `intime`) AS `q01`
## FROM `icustays`)
## WHERE (`q01` <= 1)

Transform in database, plot in R

icustays_tble %>%
  group_by(subject_id) %>%
  summarise(n = n()) %>%
  ggplot() +
  geom_bar(mapping = aes(x = n)) + 
  labs(x = "# ICU stays of a patient")

SQL translation

dbplyr package (a dplyr backend for databases) has a function, translate_sql, that lets you experiment with how R functions are translated to SQL:

translate_sql(x == 1 & (y < 2 | z > 3))

## <SQL> `x` = 1.0 AND (`y` < 2.0 OR `z` > 3.0)

translate_sql(x ^ 2 < 10)

## <SQL> POWER(`x`, 2.0) < 10.0

translate_sql(x %% 2 == 10)

## <SQL> `x` % 2.0 = 10.0

translate_sql(paste(x, y))

## <SQL> CONCAT_WS(' ', `x`, `y`)

translate_sql(mean(x))

## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.

## <SQL> AVG(`x`) OVER ()

translate_sql(mean(x, na.rm = TRUE))

## <SQL> AVG(`x`) OVER ()

Timings

Let’s compare the timings of dplyr (in-memory) and dbplyr (on disk database).

dplyr using tibble:

csvfile <- str_c(mimic_path, "/icu/icustays.csv.gz")
icustays_tibble <- read_csv(csvfile)

## Rows: 76540 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): first_careunit, last_careunit
## dbl  (4): subject_id, hadm_id, stay_id, los
## dttm (2): intime, outtime
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

timing_tibble <- system.time(
  icustays_tibble %>%
    group_by(subject_id) %>%
    summarize(n = n())
)
timing_tibble

##    user  system elapsed 
##   1.442   0.001   1.444

dbplyr using SQLite:

icustays_sql <- tbl(con, "icustays")
timing_sql <- system.time(
  icustays_sql %>%
    group_by(subject_id) %>%
    summarize(n = n())
)
timing_sql

##    user  system elapsed 
##   0.002   0.000   0.002

SQLite (0.002 seconds) was much faster than tibble (1.444 seconds). But SQLite is disk-based, while the tibble is in memory. Why is the discrepancy?

Laziness

dplyr/dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.

When building a query, often we don’t want the entire table. We want just enough to check if our query is working.
Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
Therefore, by default dbplyr
- won’t connect and query the database until absolutely necessary (e.g. show output),
- and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like

icustays_sql %>%
  group_by(subject_id) %>%
  summarize(n = n())

## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.37.0
## #   [/home/huazhou/203b-2022winter/slides/12-dbplyr/mimiciv.sqlite]
##    subject_id     n
##         <int> <int>
##  1   10000032     1
##  2   10000980     1
##  3   10001217     2
##  4   10001725     1
##  5   10001884     1
##  6   10002013     1
##  7   10002155     3
##  8   10002223     1
##  9   10002348     1
## 10   10002428     4
## # … with more rows

Full query

To force a full query and return a complete table it is necessary to use the collect function.

system.time(
  icustays_sql %>%
    group_by(subject_id) %>%
    summarize(n = n()) %>%
    collect()
)

##    user  system elapsed 
##   0.124   0.028   0.153

Old ggplot2 doesn’t do ``Transform in database, plot in R”

Earlier we see dbplyr connects with ggplot2 (v3.3.5) seamlessly. Remember ggplot2 needs to collect the query results for plotting.

icustays_sql %>%
  count(subject_id) %>%
  ggplot() + 
  geom_boxplot(mapping = aes(y = n))

Older version of ggplot2, e.g., v2.2.1, will output error. This is because ggplot2 needed to compute the count per bin by going through all the rows. But here icustays_sql is just a pointer to the SQLite table. We had to use the transform in database, plot in R strategy.

icustays_sql %>%
  count(subject_id) %>%
  collect() %>%
  ggplot() + 
  geom_boxplot(mapping = aes(y = n))

Close connection to database

dbDisconnect(con)

Working with Databases - Part II

Biostat 203B

Dr. Hua Zhou @ UCLA

Feb 10, 2022