sessionInfo()
Biostat 203B Homework 1
Due Jan 26, 2024 @ 11:59PM
Display machine information for reproducibility:
Q1. Git/GitHub
No handwritten homework reports are accepted for this course. We work with Git and GitHub. Efficient and abundant use of Git, e.g., frequent and well-documented commits, is an important criterion for grading your homework.
Apply for the Student Developer Pack at GitHub using your UCLA email. You’ll get GitHub Pro account for free (unlimited public and private repositories).
Create a private repository
biostat-203b-2024-winter
and addHua-Zhou
and TA team (Tomoki-Okuno
for Lec 1;jonathanhori
andjasenzhang1
for Lec 80) as your collaborators with write permission.Top directories of the repository should be
hw1
,hw2
, … Maintain two branchesmain
anddevelop
. Thedevelop
branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. Themain
branch will be your presentation area. Submit your homework files (Quarto fileqmd
,html
file converted by Quarto, all code and extra data sets to reproduce results) in themain
branch.After each homework due date, course reader and instructor will check out your
main
branch for grading. Tag each of your homework submissions with tag nameshw1
,hw2
, … Tagging time will be used as your submission time. That means if you tag yourhw1
submission after deadline, penalty points will be deducted for late submission.After this course, you can make this repository public and use it to demonstrate your skill sets on job market.
Q2. Data ethics training
This exercise (and later in this course) uses the MIMIC-IV data v2.2, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at https://mimic.mit.edu/docs/gettingstarted/ to (1) complete the CITI Data or Specimens Only Research
course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here. You must complete Q2 before working on the remaining questions. (Hint: The CITI training takes a few hours and the PhysioNet credentialing takes a couple days; do not leave it to the last minute.)
Q3. Linux Shell Commands
- Make the MIMIC v2.2 data available at location
~/mimic
.
ls -l ~/mimic/
Refer to the documentation https://physionet.org/content/mimiciv/2.2/ for details of data files. Please, do not put these data files into Git; they are big. Do not copy them into your directory. Do not decompress the gz data files. These create unnecessary big files and are not big-data-friendly practices. Read from the data folder ~/mimic
directly in following exercises.
Use Bash commands to answer following questions.
Display the contents in the folders
hosp
andicu
using Bash commandls -l
. Why are these data files distributed as.csv.gz
files instead of.csv
(comma separated values) files? Read the page https://mimic.mit.edu/docs/iv/ to understand what’s in each folder.Briefly describe what Bash commands
zcat
,zless
,zmore
, andzgrep
do.(Looping in Bash) What’s the output of the following bash script?
for datafile in ~/mimic/hosp/{a,l,pa}*.gz
do
ls -l $datafile
done
Display the number of lines in each data file using a similar loop. (Hint: combine linux commands zcat <
and wc -l
.)
Display the first few lines of
admissions.csv.gz
. How many rows are in this data file? How many unique patients (identified bysubject_id
) are in this data file? Do they match the number of patients listed in thepatients.csv.gz
file? (Hint: combine Linux commandszcat <
,head
/tail
,awk
,sort
,uniq
,wc
, and so on.)What are the possible values taken by each of the variable
admission_type
,admission_location
,insurance
, andethnicity
? Also report the count for each unique value of these variables. (Hint: combine Linux commandszcat
,head
/tail
,awk
,uniq -c
,wc
, and so on; skip the header line.)To compress, or not to compress. That’s the question. Let’s focus on the big data file
labevents.csv.gz
. Compare compressed gz file size to the uncompressed file size. Compare the run times ofzcat < ~/mimic/labevents.csv.gz | wc -l
versuswc -l labevents.csv
. Discuss the trade off between storage and speed for big data files. (Hint:gzip -dk < FILENAME.gz > ./FILENAME
. Remember to delete the largelabevents.csv
file after the exercise.)
Q4. Who’s popular in Price and Prejudice
- You and your friend just have finished reading Pride and Prejudice by Jane Austen. Among the four main characters in the book, Elizabeth, Jane, Lydia, and Darcy, your friend thinks that Darcy was the most mentioned. You, however, are certain it was Elizabeth. Obtain the full text of the novel from http://www.gutenberg.org/cache/epub/42671/pg42671.txt and save to your local folder.
wget -nc http://www.gutenberg.org/cache/epub/42671/pg42671.txt
Explain what wget -nc
does. Do not put this text file pg42671.txt
in Git. Complete the following loop to tabulate the number of times each of the four characters is mentioned using Linux commands.
wget -nc http://www.gutenberg.org/cache/epub/42671/pg42671.txt
for char in Elizabeth Jane Lydia Darcy
do
echo $char:
# some bash commands here
done
- What’s the difference between the following two commands?
echo 'hello, world' > test1.txt
and
echo 'hello, world' >> test2.txt
- Using your favorite text editor (e.g.,
vi
), type the following and save the file asmiddle.sh
:
#!/bin/sh
# Select lines from the middle of a file.
# Usage: bash middle.sh filename end_line num_lines
head -n "$2" "$1" | tail -n "$3"
Using chmod
to make the file executable by the owner, and run
./middle.sh pg42671.txt 20 5
Explain the output. Explain the meaning of "$1"
, "$2"
, and "$3"
in this shell script. Why do we need the first line of the shell script?
Q5. More fun with Linux
Try following commands in Bash and interpret the results: cal
, cal 2024
, cal 9 1752
(anything unusual?), date
, hostname
, arch
, uname -a
, uptime
, who am i
, who
, w
, id
, last | head
, echo {con,pre}{sent,fer}{s,ed}
, time sleep 5
, history | tail
.
Q6. Book
Git clone the repository https://github.com/christophergandrud/Rep-Res-Book for the book Reproducible Research with R and RStudio to your local machine.
Open the project by clicking
rep-res-3rd-edition.Rproj
and compile the book by clickingBuild Book
in theBuild
panel of RStudio. (Hint: I was able to buildgit_book
andepub_book
but notpdf_book
.)
The point of this exercise is (1) to get the book for free and (2) to see an example how a complicated project such as a book can be organized in a reproducible way.
For grading purpose, include a screenshot of Section 4.1.5 of the book here.