Working with Databases - Part II

Biostat 203B

Author

Dr. Hua Zhou @ UCLA

Published

February 22, 2024

1 Introduction

In this lecture we will demonstrate:

  • Import data from bigish csv files (MIMIC-IV).

  • Deposit data into an SQLite database.

  • Query SQLite database.

  • Transform in database and plot in R.

  • Comparing SQLite, DuckDB, and BigQuery.

2 Machine information

Display machine information for reproducibility.

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/aarch64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/aarch64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.2    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.8    xfun_0.42        
[13] digest_0.6.34     rlang_1.1.3       evaluate_0.23    

Load necessary R packages

library(bigrquery)
library(tidyverse)
library(DBI)
library(RSQLite)
library(dbplyr)
library(duckdb)
library(arrow)
# display version of SQLite
sqlite3 --version
3.37.2 2022-01-06 13:25:41 872ba256cbf61d9290b571c0e6d82a20c224ca3ad82971edc46b29818d5dalt1

3 CSV file

Code in this lecture assumes that the MIMIC-IV data is available at ~/mimic.

Display content of MIMIC-IV data folder:

ls -l ~/mimic
total 24
-rw-r--r--  1 504 rstudio 13332 Jan  5  2023 CHANGELOG.txt
drwxr-xr-x 25 504 rstudio   800 Jan 17 21:41 hosp
drwxr-xr-x 11 504 rstudio   352 Nov 24 19:34 icu
-rw-r--r--  1 504 rstudio  2518 Jan  6  2023 LICENSE.txt
-rw-r--r--  1 504 rstudio  2884 Jan  6  2023 SHA256SUMS.txt

4 Read CSVs and deposit to an SQLite database

Here, we will import only one csv file icustays.csv.gz for demonstration purpose. Motivated students can write a Bash script for loading all MIMIC-IV data files into a SQLite database and contribute to https://github.com/MIT-LCP/mimic-code.

Create an empty database file mimiciv.sqlite:

# bash command
touch mimiciv.sqlite

Deposit the icu/icustatys.csv.gz file:

# bash command
# delete icustays table if exists
sqlite3 mimiciv.sqlite 'DROP TABLE IF EXISTS icustays;'

Create an empty icustays table with data types. Because SQLite does not support date-time data type (https://www.sqlite.org/datatype3.html), we store intime and outtime as TEXT.

# bash command
sqlite3 mimiciv.sqlite 'CREATE TABLE icustays (
  subject_id INTEGER,
  hadm_id INTEGER,
  stay_id INTEGER,
  first_careunit TEXT,
  last_careunit TEXT,
  intime TEXT,
  outtime TEXT,
  los REAL
  )'
# bash command
zcat < ~/mimic/icu/icustays.csv.gz | \
  tail -n +2 | \
  sqlite3 mimiciv.sqlite -csv '.import /dev/stdin icustays'

5 Read data from database

Connect to the database mimiciv.sqlite and list the tables:

con <- dbConnect(
  RSQLite::SQLite(), 
  dbname = "./mimiciv.sqlite"
)
dbListTables(con)
[1] "icustays"

Read the table icustays:

icustays_tble <- tbl(con, "icustays") |> 
  print(width = Inf)
# Source:   table<icustays> [?? x 8]
# Database: sqlite 3.44.2 [/home/rstudio/203b-lecture/slides/12-dbplyr/mimiciv.sqlite]
   subject_id  hadm_id  stay_id first_careunit                                  
        <int>    <int>    <int> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
 9   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
10   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <chr>              
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
   outtime               los
   <chr>               <dbl>
 1 2180-07-23 23:50:47 0.410
 2 2189-06-27 20:38:27 0.498
 3 2157-11-21 22:08:00 1.12 
 4 2157-12-20 14:27:41 0.948
 5 2110-04-12 23:59:56 1.34 
 6 2131-01-20 08:27:30 9.17 
 7 2160-05-19 17:33:33 1.31 
 8 2131-03-10 18:09:21 0.859
 9 2129-08-10 17:02:38 6.18 
10 2130-09-27 22:13:41 3.89 
# ℹ more rows

How many rows?

icustays_tble |> 
  summarise(n = n()) |>
  show_query() |>
  print()
<SQL>
SELECT COUNT(*) AS `n`
FROM `icustays`
# Source:   SQL [1 x 1]
# Database: sqlite 3.44.2 [/home/rstudio/203b-lecture/slides/12-dbplyr/mimiciv.sqlite]
      n
  <int>
1 73181

6 Use dplyr with SQLite

Keep the first ICU stay for each patient:

icustays_subset <- icustays_tble |>
  # first ICU stay of each unique `subject_id`
  group_by(subject_id) |>
  slice_min(intime) |>
  ungroup() |>
  # arrange(intime, .by_group = TRUE) |>
  # slice_head(n = 1) |>
  # left_join(icustays_tble, by = c("subject_id", "intime")) |>
  show_query() |>
  print(width = Inf)
<SQL>
SELECT
  `subject_id`,
  `hadm_id`,
  `stay_id`,
  `first_careunit`,
  `last_careunit`,
  `intime`,
  `outtime`,
  `los`
FROM (
  SELECT
    `icustays`.*,
    RANK() OVER (PARTITION BY `subject_id` ORDER BY `intime`) AS `col01`
  FROM `icustays`
) AS `q01`
WHERE (`col01` <= 1)
# Source:   SQL [?? x 8]
# Database: sqlite 3.44.2 [/home/rstudio/203b-lecture/slides/12-dbplyr/mimiciv.sqlite]
   subject_id  hadm_id  stay_id first_careunit                                  
        <int>    <int>    <int> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 5   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 6   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 7   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
 8   10002348 22725460 32610785 Neuro Intermediate                              
 9   10002428 28662225 33987268 Medical Intensive Care Unit (MICU)              
10   10002430 26295318 38392119 Coronary Care Unit (CCU)                        
   last_careunit                                    intime             
   <chr>                                            <chr>              
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 5 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 6 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 7 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
 8 Neuro Intermediate                               2112-11-30 23:24:00
 9 Medical Intensive Care Unit (MICU)               2156-04-12 16:24:18
10 Coronary Care Unit (CCU)                         2129-06-13 00:43:08
   outtime               los
   <chr>               <dbl>
 1 2180-07-23 23:50:47 0.410
 2 2189-06-27 20:38:27 0.498
 3 2157-11-21 22:08:00 1.12 
 4 2110-04-12 23:59:56 1.34 
 5 2131-01-20 08:27:30 9.17 
 6 2160-05-19 17:33:33 1.31 
 7 2129-08-10 17:02:38 6.18 
 8 2112-12-10 18:25:13 9.79 
 9 2156-04-17 15:57:08 4.98 
10 2129-06-15 22:51:40 2.92 
# ℹ more rows

How many rows in icustays_subset?

icustays_subset |>
  summarise(n = n()) |>
  show_query() |>
  print()
<SQL>
SELECT COUNT(*) AS `n`
FROM (
  SELECT
    `subject_id`,
    `hadm_id`,
    `stay_id`,
    `first_careunit`,
    `last_careunit`,
    `intime`,
    `outtime`,
    `los`
  FROM (
    SELECT
      `icustays`.*,
      RANK() OVER (PARTITION BY `subject_id` ORDER BY `intime`) AS `col01`
    FROM `icustays`
  ) AS `q01`
  WHERE (`col01` <= 1)
) AS `q01`
# Source:   SQL [1 x 1]
# Database: sqlite 3.44.2 [/home/rstudio/203b-lecture/slides/12-dbplyr/mimiciv.sqlite]
      n
  <int>
1 50920

7 SQL query

show_query usefully shows the SQL query translated from dplyr query.

class(icustays_subset)
[1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
[4] "tbl_lazy"             "tbl"                 
show_query(icustays_subset)
<SQL>
SELECT
  `subject_id`,
  `hadm_id`,
  `stay_id`,
  `first_careunit`,
  `last_careunit`,
  `intime`,
  `outtime`,
  `los`
FROM (
  SELECT
    `icustays`.*,
    RANK() OVER (PARTITION BY `subject_id` ORDER BY `intime`) AS `col01`
  FROM `icustays`
) AS `q01`
WHERE (`col01` <= 1)

8 Transform in database, plot in R

ggplot will compute the plot in R, but the data transformation is done in the database. Always make sure to do data wrangling (filter, select, group by, summarise) in the database if possible.

icustays_tble |>
  group_by(subject_id) |>
  summarise(n = n()) |>
  ggplot() +
  geom_bar(mapping = aes(x = n)) + 
  labs(x = "# ICU stays of a patient")

9 SQL translation

dbplyr package (a dplyr backend for databases) has a function, translate_sql, that lets you experiment with how R functions are translated to SQL:

translate_sql(x == 1 & (y < 2 | z > 3))
translate_sql(x ^ 2 < 10)
translate_sql(x %% 2 == 10)
translate_sql(paste(x, y))
translate_sql(mean(x))
translate_sql(mean(x, na.rm = TRUE))

10 Timings

Let’s compare the timings of dplyr (in-memory) and dbplyr (on disk database).

  • dplyr using tibble:
csvfile <- "~/mimic/icu/icustays.csv.gz"
icustays_tibble <- read_csv(csvfile)
timing_tibble <-icustays_tibble |>
    group_by(subject_id) |>
    summarize(n = n()) |>
    system.time()
timing_tibble
   user  system elapsed 
  0.153   0.005   0.158 
  • dbplyr using SQLite:
icustays_sql <- tbl(con, "icustays")
timing_sql <- icustays_sql |>
  group_by(subject_id) |>
  summarize(n = n()) |>
  system.time()
timing_sql
   user  system elapsed 
  0.011   0.000   0.011 

SQLite (0.011 seconds) was much faster than tibble (0.158 seconds). But SQLite is disk-based, while the tibble is in memory. Why is the discrepancy?

11 Laziness

dplyr/dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.

  • When building a query, often we don’t want the entire table. We want just enough to check if our query is working.

  • Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.

  • Therefore, by default dbplyr

    • won’t connect and query the database until absolutely necessary (e.g. show output),

    • and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like

icustays_sql |>
  group_by(subject_id) |>
  summarize(n = n())
# Source:   SQL [?? x 2]
# Database: sqlite 3.44.2 [/home/rstudio/203b-lecture/slides/12-dbplyr/mimiciv.sqlite]
   subject_id     n
        <int> <int>
 1   10000032     1
 2   10000980     1
 3   10001217     2
 4   10001725     1
 5   10001884     1
 6   10002013     1
 7   10002155     3
 8   10002348     1
 9   10002428     4
10   10002430     1
# ℹ more rows

12 Full query

To force a full query and return a complete table it is necessary to use the collect function.

icustays_sql |>
  group_by(subject_id) |>
  summarize(n = n()) |>
  collect() |>
  system.time()
   user  system elapsed 
  0.035   0.001   0.036 

13 Close connection to database

dbDisconnect(con)

14 DuckDB vs SQLite

BTW, as modern data scientists, we should all start using DuckDB (https://duckdb.org/) instead of SQLite. DuckDB is a modern, embeddable SQL OLAP database management system. It is designed to handle analytical workloads (OLAP) on read-only data. It is based on a column-store architecture and is designed to be very fast to query, highly compressible, and run on modern hardware. It is a great alternative to SQLite for analytical workloads.

15 DuckDB vs BigQuery

Let’s now compare DuckDB with BigQuery. We will use the MIMIC-IV data for this comparison: calculate the average measurements for 5 vitals: 220045 (heart rate), 220179 (systolic BP), 220180 (diastolic BP), 223761 (respiratory rate), and 220210 (temperature in F).

open_dataset(
  sources = "chartevents_pq", 
  format = "parquet"
  ) |>
  # create a virtual table in DuckDB
  to_duckdb() |>
  # filter rows
  filter(itemid %in% c(220045L, 220179L, 220180L, 223761L, 220210L)) |>
  # group by and count
  group_by(itemid) |>
  summarise(mean = mean(valuenum, na.rm = TRUE)) |>
  # force computation
  collect() |>
  # pull item info
  left_join(
    read_csv("~/mimic/icu/d_items.csv.gz", show_col_types = FALSE) |> 
      select(itemid, label),
    by = "itemid"
  ) |>
  arrange(itemid) |>
  # print results
  print() |>
  system.time()
# A tibble: 5 × 3
  itemid  mean label                                
   <dbl> <dbl> <chr>                                
1 220045  87.8 Heart Rate                           
2 220179 121.  Non Invasive Blood Pressure systolic 
3 220180  66.4 Non Invasive Blood Pressure diastolic
4 220210  20.5 Respiratory Rate                     
5 223761  98.7 Temperature Fahrenheit               
   user  system elapsed 
 10.761   2.707   3.480 

Achieving this task under 4 seconds is impressive.

How about BigQuery? We use the bigrquery package to connect to BigQuery and run the same query. BigQuery is able to achieve this task under 3 seconds. But this may not be fair. BigQuery may cache this particular query if it’s been run before.

# BigQuery authentication using service account token
bq_auth(path = "biostat-203b-2024-winter-313290ce47a6.json")
# Connect to the BigQuery database `biostat-203b-2024-winter.mimic4_v2_2`
con_bq <- dbConnect(
    bigrquery::bigquery(),
    project = "biostat-203b-2024-winter",
    dataset = "mimic4_v2_2",
    billing = "biostat-203b-2024-winter"
)

dbListTables(con_bq)
 [1] "admissions"         "caregiver"          "chartevents"       
 [4] "d_hcpcs"            "d_icd_diagnoses"    "d_icd_procedures"  
 [7] "d_items"            "d_labitems"         "datetimeevents"    
[10] "diagnoses_icd"      "drgcodes"           "emar"              
[13] "emar_detail"        "hcpcsevents"        "icustays"          
[16] "ingredientevents"   "inputevents"        "labevents"         
[19] "microbiologyevents" "omr"                "outputevents"      
[22] "patients"           "pharmacy"           "poe"               
[25] "poe_detail"         "prescriptions"      "procedureevents"   
[28] "procedures_icd"     "provider"           "services"          
[31] "transfers"         
tbl(con_bq, "chartevents") |>
  # filter rows
  filter(itemid %in% c(220045L, 220179L, 220180L, 223761L, 220210L)) |>
  # group by and count
  group_by(itemid) |>
  summarise(mean = mean(valuenum, na.rm = TRUE)) |>
  # force computation
  collect() |>
  # pull item info
  left_join(
    read_csv("~/mimic/icu/d_items.csv.gz", show_col_types = FALSE) |> 
      select(itemid, label),
    by = "itemid"
  ) |>
  arrange(itemid) |>
  # print results
  print() |>
  system.time()
# A tibble: 5 × 3
  itemid  mean label                                
   <dbl> <dbl> <chr>                                
1 220045  87.8 Heart Rate                           
2 220179 121.  Non Invasive Blood Pressure systolic 
3 220180  66.4 Non Invasive Blood Pressure diastolic
4 220210  20.5 Respiratory Rate                     
5 223761  98.7 Temperature Fahrenheit               
   user  system elapsed 
  0.305   0.010   4.005 
if (exists("con_bq")) {
  dbDisconnect(con_bq)
}