Hua Zhou | 周华 | Lecture 9

Announcements

Survey result: 58% (Zoom) vs 42% (in-person). We’ll be back to in-person class from Feb 15.
Today’s office hours: 6:30pm-7:30pm.

Import big datasets. On my laptop (2.9 GHz Quad-Core Intel Core i7, 16GB RAM, SSD hard drive), import labevents.csv.gz (2.09GB).

readr package in R took 205 seconds.

  read_csv(str_c(mimic_path, "/hosp/labevents.csv.gz"),
      col_select = c(subject_id, itemid, charttime, valuenum),
      col_types = cols_only(subject_id = col_double(), 
                            itemid = col_double(), 
                            charttime = col_datetime(), 
                            valuenum = col_double()),
      lazy = TRUE) %>%
  semi_join(icustays_tble, by = c("subject_id")) %>%
  filter(itemid %in% dlabitems_tble$itemid) %>%
  print(width = Inf)

data.table package in R took 195 seconds.

  fread(str_c(mimic_path, "/hosp/labevents.csv.gz"),
   select = c(
     subject_id = "numeric", 
     itemid = "numeric", 
     charttime = "POSIXct",
     valuenum = "numeric")
   ) %>%
  as_tibble() %>%
  semi_join(icustays_tble, by = c("subject_id")) %>%
  filter(itemid %in% dlabitems_tble$itemid) %>%
  print(width = Inf)

CSV.jl and DataFrames.jl packages in Juia took 190 seconds.

  fn = "/Users/huazhou/Documents/Box Sync/MIMIC/mimic-iv-1.0/hosp/labevents.csv.gz"
  item_list = [50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931, 50960, 50893]
  @time labevents_df = 
  open(GzipDecompressorStream, fn) do stream
 @pipe CSV.File(
     stream; 
     select = ["subject_id", "itemid", "charttime", "valuenum"],
     types = Dict(
         "subject_id" => Int,
         "itemid" => Int,
         "charttime" => DateTime,
         "valuenum" => Float64),
     dateformat = "yyyy-mm-dd HH:MM:SS"
     ) |> 
 DataFrame |>
 semijoin(_, icustays_df, on = :subject_id) |> 
 filter(row -> row.itemid ∈ item_list, _)
  end

Use bash commands awk to obtain filtered data files. It took about 900 seconds. Very memory efficient.

  zcat < labevents.csv.gz | 
  awk -F, '{OFS = ","} {if ($6 == 220045 || $6 == 220181 || $6 == 220179 || $6 == 223761 || $6 == 220210 || $6 == "itemid") print $1,$2,$3,$4,$6,$8}' | 
  gzip > labevents_filtered_itemid.csv.gz

Today

stringr
Web scraping
HW2

Lecture 9 01 Feb 2022

Announcements

Today