Biostat 203B Homework 1

Due Jan 27 @ 11:59PM

Author

Your Name and UID

Display machine information for reproducibility:

sessionInfo()

1 Q1. Git/GitHub

No handwritten homework reports are accepted for this course. We work with Git and GitHub. Efficient and abundant use of Git, e.g., frequent and well-documented commits, is an important criterion for grading your homework.

  1. Apply for the Student Developer Pack at GitHub using your UCLA email. You’ll get GitHub Pro account for free (unlimited public and private repositories).

  2. Create a private repository biostat-203b-2023-winter and add Hua-Zhou and tomokiokuno0528 as your collaborators with write permission.

  3. Top directories of the repository should be hw1, hw2, … Maintain two branches master and develop. The develop branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The master branch will be your presentation area. Submit your homework files (Quarto file qmd, html file converted by Quarto, all code and extra data sets to reproduce results) in main branch.

  4. After each homework due date, course reader and instructor will check out your master branch for grading. Tag each of your homework submissions with tag names hw1, hw2, … Tagging time will be used as your submission time. That means if you tag your hw1 submission after deadline, penalty points will be deducted for late submission.

  5. After this course, you can make this repository public and use it to demonstrate your skill sets on job market.

2 Q2. Data ethics training

This exercise (and later in this course) uses the MIMIC-IV data, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at https://mimic.mit.edu/docs/gettingstarted/ to (1) complete the CITI Data or Specimens Only Research course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here. (Hint: The CITI training takes a couple hours and the PhysioNet credentialing takes a couple days; do not leave it to the last minute.)

3 Q3. Linux Shell Commands

  1. The ~/mimic folder within the Docker container contains data sets from MIMIC-IV. Refer to the documentation https://mimic.mit.edu/docs/iv/ for details of data files.
ls -l ~/mimic

Please, do not put these data files into Git; they are big. Do not copy them into your directory. Do not decompress the gz data files. These create unnecessary big files on storage and are not big data friendly practices. Just read from the data folder ~/mimic directly in following exercises.

Use Bash commands to answer following questions.

  1. Display the contents in the folders core, hosp, icu. Why are these data files distributed as .csv.gz files instead of .csv (comma separated values) files? Read the page https://mimic.mit.edu/docs/iv/ to understand what’s in each folder.

  2. Briefly describe what bash commands zcat, zless, zmore, and zgrep do.

  3. What’s the output of the following bash script?

for datafile in ~/mimic/core/*.gz
do
  ls -l $datafile
done

Display the number of lines in each data file using a similar loop.

  1. Display the first few lines of admissions.csv.gz. How many rows are in this data file? How many unique patients (identified by subject_id) are in this data file? (Hint: combine Linux commands zcat, head/tail, awk, sort, uniq, wc, and so on.)

  2. What are the possible values taken by each of the variable admission_type, admission_location, insurance, and ethnicity? Also report the count for each unique value of these variables. (Hint: combine Linux commands zcat, head/tail, awk, uniq -c, wc, and so on.)

5 Q5. More fun with Linux

Try following commands in Bash and interpret the results: cal, cal 2021, cal 9 1752 (anything unusual?), date, hostname, arch, uname -a, uptime, who am i, who, w, id, last | head, echo {con,pre}{sent,fer}{s,ed}, time sleep 5, history | tail.