Announcements
-
Survey result: 58% (Zoom) vs 42% (in-person). We’ll be back to in-person class from Feb 15.
-
Today’s office hours: 6:30pm-7:30pm.
-
Import big datasets. On my laptop (2.9 GHz Quad-Core Intel Core i7, 16GB RAM, SSD hard drive), import
labevents.csv.gz
(2.09GB).readr
package in R took 205 seconds.read_csv(str_c(mimic_path, "/hosp/labevents.csv.gz"), col_select = c(subject_id, itemid, charttime, valuenum), col_types = cols_only(subject_id = col_double(), itemid = col_double(), charttime = col_datetime(), valuenum = col_double()), lazy = TRUE) %>% semi_join(icustays_tble, by = c("subject_id")) %>% filter(itemid %in% dlabitems_tble$itemid) %>% print(width = Inf)
- data.table package in R took 195 seconds.
fread(str_c(mimic_path, "/hosp/labevents.csv.gz"), select = c( subject_id = "numeric", itemid = "numeric", charttime = "POSIXct", valuenum = "numeric") ) %>% as_tibble() %>% semi_join(icustays_tble, by = c("subject_id")) %>% filter(itemid %in% dlabitems_tble$itemid) %>% print(width = Inf)
- CSV.jl and DataFrames.jl packages in Juia took 190 seconds.
fn = "/Users/huazhou/Documents/Box Sync/MIMIC/mimic-iv-1.0/hosp/labevents.csv.gz" item_list = [50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931, 50960, 50893] @time labevents_df = open(GzipDecompressorStream, fn) do stream @pipe CSV.File( stream; select = ["subject_id", "itemid", "charttime", "valuenum"], types = Dict( "subject_id" => Int, "itemid" => Int, "charttime" => DateTime, "valuenum" => Float64), dateformat = "yyyy-mm-dd HH:MM:SS" ) |> DataFrame |> semijoin(_, icustays_df, on = :subject_id) |> filter(row -> row.itemid ∈ item_list, _) end
- Use bash commands
awk
to obtain filtered data files. It took about 900 seconds. Very memory efficient.zcat < labevents.csv.gz | awk -F, '{OFS = ","} {if ($6 == 220045 || $6 == 220181 || $6 == 220179 || $6 == 223761 || $6 == 220210 || $6 == "itemid") print $1,$2,$3,$4,$6,$8}' | gzip > labevents_filtered_itemid.csv.gz
Today
-
stringr
-
Web scraping
-
HW2