Announcements

  • Survey result: 58% (Zoom) vs 42% (in-person). We’ll be back to in-person class from Feb 15.

  • Today’s office hours: 6:30pm-7:30pm.

  • Import big datasets. On my laptop (2.9 GHz Quad-Core Intel Core i7, 16GB RAM, SSD hard drive), import labevents.csv.gz (2.09GB).

    1. readr package in R took 205 seconds.
        read_csv(str_c(mimic_path, "/hosp/labevents.csv.gz"),
            col_select = c(subject_id, itemid, charttime, valuenum),
            col_types = cols_only(subject_id = col_double(), 
                                  itemid = col_double(), 
                                  charttime = col_datetime(), 
                                  valuenum = col_double()),
            lazy = TRUE) %>%
        semi_join(icustays_tble, by = c("subject_id")) %>%
        filter(itemid %in% dlabitems_tble$itemid) %>%
        print(width = Inf)
      
    2. data.table package in R took 195 seconds.
        fread(str_c(mimic_path, "/hosp/labevents.csv.gz"),
         select = c(
           subject_id = "numeric", 
           itemid = "numeric", 
           charttime = "POSIXct",
           valuenum = "numeric")
         ) %>%
        as_tibble() %>%
        semi_join(icustays_tble, by = c("subject_id")) %>%
        filter(itemid %in% dlabitems_tble$itemid) %>%
        print(width = Inf)
      
    3. CSV.jl and DataFrames.jl packages in Juia took 190 seconds.
        fn = "/Users/huazhou/Documents/Box Sync/MIMIC/mimic-iv-1.0/hosp/labevents.csv.gz"
        item_list = [50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931, 50960, 50893]
        @time labevents_df = 
        open(GzipDecompressorStream, fn) do stream
       @pipe CSV.File(
           stream; 
           select = ["subject_id", "itemid", "charttime", "valuenum"],
           types = Dict(
               "subject_id" => Int,
               "itemid" => Int,
               "charttime" => DateTime,
               "valuenum" => Float64),
           dateformat = "yyyy-mm-dd HH:MM:SS"
           ) |> 
       DataFrame |>
       semijoin(_, icustays_df, on = :subject_id) |> 
       filter(row -> row.itemid  item_list, _)
        end
      
    4. Use bash commands awk to obtain filtered data files. It took about 900 seconds. Very memory efficient.
        zcat < labevents.csv.gz | 
        awk -F, '{OFS = ","} {if ($6 == 220045 || $6 == 220181 || $6 == 220179 || $6 == 223761 || $6 == 220210 || $6 == "itemid") print $1,$2,$3,$4,$6,$8}' | 
        gzip > labevents_filtered_itemid.csv.gz
      

Today

  • stringr

  • Web scraping

  • HW2