Linux Basics

Biostat 203B

Author

Dr. Hua Zhou @ UCLA

Published

January 10, 2023

Display machine information for reproducibility:

sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] digest_0.6.31     lifecycle_1.0.3   jsonlite_1.8.4    magrittr_2.0.3   
 [5] evaluate_0.19     rlang_1.0.6       stringi_1.7.8     cli_3.5.0        
 [9] rstudioapi_0.14   vctrs_0.5.1       rmarkdown_2.19    tools_4.2.2      
[13] stringr_1.5.0     glue_1.6.2        htmlwidgets_1.6.0 xfun_0.35        
[17] yaml_2.3.6        fastmap_1.1.0     compiler_4.2.2    htmltools_0.5.4  
[21] knitr_1.41       

1 Preface

  • This html is rendered from linux.qmd on Linux Ubuntu 22.04 (jammy).
    • Mac users can render linux.qmd directly. Some tools such as tree and locate need to be installed (follow the error messages).

    • Windows users need to install WSL (Windows Subsystem for Linux) to render linux.qmd using Ubuntu. Some tools such as tree and locate need to be installed (follow the error messages).

    • Both Mac and Windows users can also use Docker to render linux.qmd within a Ubuntu container.

  • In this lecture, most code chunks are bash commands instead of R code.

2 Why Linux

Linux is the most common platform for scientific computing and deployment of data science tools.

  • Open source and community support.

  • Things break; when they break using Linux, it’s easy to fix.

  • Scalability: portable devices (Android, iOS), laptops, servers, clusters, and super computers.

    • E.g. UCLA Hoffmann2 cluster runs on Linux; most machines in cloud (AWS, Azure, GCP) run on Linux.
  • Cost: it’s free!

3 Distributions of Linux

  • Debian/Ubuntu is a popular choice for personal computers.

  • RHEL/CentOS is popular on servers. (In December 2020, Red Hat terminated the development of CentOS Linux distribution.)

  • UCLA Hoffman2 cluster runs CentOS 7.9.2009 (as of 2023-01-01).

  • MacOS was originally derived from Unix/Linux (Darwin kernel). It is POSIX compliant. Most shell commands we review here apply to MacOS terminal as well. Windows/DOS, unfortunately, is a totally different breed.

  • Show operating system (OS) type:

echo $OSTYPE
linux-gnu
  • Show distribution/version on Linux:
# only on Linux terminal
cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
  • Show distribution/version on MacOS:
# only on Mac terminal
sw_vers -productVersion

or

# only on Mac terminal
system_profiler SPSoftwareDataType

4 Linux shells

4.1 Shells

  • A shell translates commands to OS instructions.

  • Most commonly used shells include bash, csh, tcsh, zsh, etc.

  • The default shell in MacOS changed from bash to zsh since MacOS v10.15.

  • Sometimes a command and a script does not run simply because it’s written for another shell.

  • We mostly use bash shell commands in this class.

  • Determine the current shell:

echo $SHELL
/bin/bash
  • List available shells:
cat /etc/shells
# /etc/shells: valid login shells
/bin/sh
/bin/bash
/usr/bin/bash
/bin/rbash
/usr/bin/rbash
/usr/bin/sh
/bin/dash
/usr/bin/dash
  • Change to another shell:
```{bash}
#| eval: false
exec bash -l
```

The -l option indicates it should be a login shell.

  • Change your login shell permanently:
chsh -s /bin/bash [USERNAME]

Then log out and log in.

4.2 Command history and bash completion

We can navigate to previous/next commands by the upper and lower keys, or maintain a command history stack using pushd and popd commands.

Bash provides the following standard completion for the Linux users by default. Much less typing errors and time!

  • Pathname completion.

  • Filename completion.

  • Variablename completion: echo $[TAB][TAB].

  • Username completion: cd ~[TAB][TAB].

  • Hostname completion ssh huazhou@[TAB][TAB].

  • It can also be customized to auto-complete other stuff such as options and command’s arguments. Google bash completion for more information.

4.3 man is man’s best friend

Online help for shell commands: man [COMMANDNAME].

# display the first 30 lines of documentation for the ls command
man ls | head -30
This system has been minimized by removing packages and content that are
not required on a system that users do not log into.

To restore this content, including manpages, you can run the 'unminimize'
command. You will still need to ensure the 'man-db' package is installed.

6 Work with text files

6.1 View/peek text files

  • cat prints the contents of a file:
cat runSim.R
## parsing command arguments
for (arg in commandArgs(TRUE)) {
  eval(parse(text=arg))
}

## check if a given integer is prime
isPrime = function(n) {
  if (n <= 3) {
    return (TRUE)
  }
  if (any((n %% 2:floor(sqrt(n))) == 0)) {
    return (FALSE)
  }
  return (TRUE)
}

## estimate mean only using observation with prime indices
estMeanPrimes = function (x) {
  n = length(x)
  ind = sapply(1:n, isPrime)
  return (mean(x[ind]))
}

# simulate data
x = rnorm(n)

# estimate mean
estMeanPrimes(x)
  • head prints the first 10 lines of a file:
head runSim.R
## parsing command arguments
for (arg in commandArgs(TRUE)) {
  eval(parse(text=arg))
}

## check if a given integer is prime
isPrime = function(n) {
  if (n <= 3) {
    return (TRUE)
  }

head -l prints the first \(l\) lines of a file:

head -15 runSim.R
## parsing command arguments
for (arg in commandArgs(TRUE)) {
  eval(parse(text=arg))
}

## check if a given integer is prime
isPrime = function(n) {
  if (n <= 3) {
    return (TRUE)
  }
  if (any((n %% 2:floor(sqrt(n))) == 0)) {
    return (FALSE)
  }
  return (TRUE)
}
  • tail prints the last 10 lines of a file:
tail runSim.R
  n = length(x)
  ind = sapply(1:n, isPrime)
  return (mean(x[ind]))
}

# simulate data
x = rnorm(n)

# estimate mean
estMeanPrimes(x)

tail -l prints the last \(l\) lines of a file:

tail -15 runSim.R
  return (TRUE)
}

## estimate mean only using observation with prime indices
estMeanPrimes = function (x) {
  n = length(x)
  ind = sapply(1:n, isPrime)
  return (mean(x[ind]))
}

# simulate data
x = rnorm(n)

# estimate mean
estMeanPrimes(x)
  • Questions:
    • How to see the 11th line of the file and nothing else?
    • What about the 11th to the last line?

6.2 Piping and redirection

  • | sends output from one command as input of another command.
ls -l | head -5
total 2436
-rw-r--r-- 1 rstudio rstudio    258 Jan 16  2020 autoSim.R
-rw-r--r-- 1 rstudio rstudio 110345 Jan 11  2015 Emacs_Reference_Card.pdf
-rw-r--r-- 1 rstudio rstudio 157353 Jan  4  2019 IDRE_Winter_2019_Workshops.pdf
-rw-r--r-- 1 rstudio rstudio 321281 Jan 10  2018 key_authentication_1.png
  • > directs output from one command to a file.

  • >> appends output from one command to a file.

  • < reads input from a file.

  • Combinations of shell commands (grep, sed, awk, …), piping and redirection, and regular expressions allow us pre-process and reformat huge text files efficiently.

  • See HW1.

6.3 less is more; more is less

  • more browses a text file screen by screen (only downwards). Scroll down one page (paging) by pressing the spacebar; exit by pressing the q key.

  • less is also a pager, but has more functionalities, e.g., scroll upwards and downwards through the input.

  • less doesn’t need to read the whole file, i.e., it loads files faster than more.

6.4 grep

grep prints lines that match an expression:

  • Show lines that contain string CentOS:
# quotes not necessary if not a regular expression
grep 'CentOS' linux.qmd
- RHEL/CentOS is popular on servers. (In December 2020, Red Hat terminated the development of CentOS Linux distribution.)
- UCLA Hoffman2 cluster runs CentOS 7.9.2009 (as of 2023-01-01).
- Show lines that contain string `CentOS`:
grep 'CentOS' linux.qmd
grep 'CentOS' *.qmd
grep -n 'CentOS' linux.qmd
- Replace `CentOS` by `RHEL` in a text file:
sed 's/CentOS/RHEL/' linux.qmd | grep RHEL
  • Search multiple text files:
grep 'CentOS' *.qmd
- RHEL/CentOS is popular on servers. (In December 2020, Red Hat terminated the development of CentOS Linux distribution.)
- UCLA Hoffman2 cluster runs CentOS 7.9.2009 (as of 2023-01-01).
- Show lines that contain string `CentOS`:
grep 'CentOS' linux.qmd
grep 'CentOS' *.qmd
grep -n 'CentOS' linux.qmd
- Replace `CentOS` by `RHEL` in a text file:
sed 's/CentOS/RHEL/' linux.qmd | grep RHEL
  • Show matching line numbers:
grep -n 'CentOS' linux.qmd
50:- RHEL/CentOS is popular on servers. (In December 2020, Red Hat terminated the development of CentOS Linux distribution.)
56:- UCLA Hoffman2 cluster runs CentOS 7.9.2009 (as of 2023-01-01).
352:- Show lines that contain string `CentOS`:
355:grep 'CentOS' linux.qmd
360:grep 'CentOS' *.qmd
365:grep -n 'CentOS' linux.qmd
382:- Replace `CentOS` by `RHEL` in a text file:
384:sed 's/CentOS/RHEL/' linux.qmd | grep RHEL
  • Find all files in current directory with .png extension:
ls | grep '.png$'
key_authentication_1.png
key_authentication_2.png
linux_directory_structure.png
linux_filepermission_oct.png
linux_filepermission.png
redhat_kills_centos.png
Richard_Stallman_2013.png
screenshot_top.png
  • Find all directories in the current directory:
ls -al | grep '^d'
drwxr-xr-x 23 rstudio rstudio    736 Jan 10 08:13 .
drwxr-xr-x 22 rstudio rstudio    704 Mar  9  2022 ..
drwxr-xr-x  3 rstudio rstudio     96 Jan 10 08:12 linux_files

6.5 sed

  • sed is a stream editor.

  • Replace CentOS by RHEL in a text file:

sed 's/CentOS/RHEL/' linux.qmd | grep RHEL
- RHEL/RHEL is popular on servers. (In December 2020, Red Hat terminated the development of CentOS Linux distribution.)
- UCLA Hoffman2 cluster runs RHEL 7.9.2009 (as of 2023-01-01).
- Show lines that contain string `RHEL`:
grep 'RHEL' linux.qmd
grep 'RHEL' *.qmd
grep -n 'RHEL' linux.qmd
- Replace `RHEL` by `RHEL` in a text file:
sed 's/RHEL/RHEL/' linux.qmd | grep RHEL

6.6 awk

  • awk is a filter and report writer.

  • First let’s display the content of the file /etc/passwd:

cat /etc/passwd
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/usr/sbin/nologin
man:x:6:12:man:/var/cache/man:/usr/sbin/nologin
lp:x:7:7:lp:/var/spool/lpd:/usr/sbin/nologin
mail:x:8:8:mail:/var/mail:/usr/sbin/nologin
news:x:9:9:news:/var/spool/news:/usr/sbin/nologin
uucp:x:10:10:uucp:/var/spool/uucp:/usr/sbin/nologin
proxy:x:13:13:proxy:/bin:/usr/sbin/nologin
www-data:x:33:33:www-data:/var/www:/usr/sbin/nologin
backup:x:34:34:backup:/var/backups:/usr/sbin/nologin
list:x:38:38:Mailing List Manager:/var/list:/usr/sbin/nologin
irc:x:39:39:ircd:/run/ircd:/usr/sbin/nologin
gnats:x:41:41:Gnats Bug-Reporting System (admin):/var/lib/gnats:/usr/sbin/nologin
nobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin
_apt:x:100:65534::/nonexistent:/usr/sbin/nologin
rstudio-server:x:999:999::/home/rstudio-server:/bin/sh
rstudio:x:1000:1000::/home/rstudio:/bin/bash
systemd-network:x:101:102:systemd Network Management,,,:/run/systemd:/usr/sbin/nologin
systemd-resolve:x:102:103:systemd Resolver,,,:/run/systemd:/usr/sbin/nologin
messagebus:x:103:104::/nonexistent:/usr/sbin/nologin

Each line contains fields (1) user name, (2) password, (3) user ID, (4) group ID, (5) user ID info, (6) home directory, and (7) command shell, separated by :.

  • Print sorted list of login names:
awk -F: '{ print $1 }' /etc/passwd | sort | head -10
_apt
backup
bin
daemon
games
gnats
irc
list
lp
mail
  • Print number of lines in a file, as NR stands for Number of Rows:
awk 'END { print NR }' /etc/passwd
24

or

wc -l /etc/passwd
24 /etc/passwd

or (not displaying file name)

wc -l < /etc/passwd
24
  • Print login names with UID in range 1000-1035:
awk -F: '{if ($3 >= 1000 && $3 <= 1047) print}' /etc/passwd
rstudio:x:1000:1000::/home/rstudio:/bin/bash
  • Print login names and log-in shells in comma-separated format:
awk -F: '{OFS = ","} {print $1, $7}' /etc/passwd
root,/bin/bash
daemon,/usr/sbin/nologin
bin,/usr/sbin/nologin
sys,/usr/sbin/nologin
sync,/bin/sync
games,/usr/sbin/nologin
man,/usr/sbin/nologin
lp,/usr/sbin/nologin
mail,/usr/sbin/nologin
news,/usr/sbin/nologin
uucp,/usr/sbin/nologin
proxy,/usr/sbin/nologin
www-data,/usr/sbin/nologin
backup,/usr/sbin/nologin
list,/usr/sbin/nologin
irc,/usr/sbin/nologin
gnats,/usr/sbin/nologin
nobody,/usr/sbin/nologin
_apt,/usr/sbin/nologin
rstudio-server,/bin/sh
rstudio,/bin/bash
systemd-network,/usr/sbin/nologin
systemd-resolve,/usr/sbin/nologin
messagebus,/usr/sbin/nologin
  • Print login names and indicate those with UID>1000 as vip:
awk -F: -v status="" '{OFS = ","} 
{if ($3 >= 1000) status="vip"; else status="regular"} 
{print $1, status}' /etc/passwd
root,regular
daemon,regular
bin,regular
sys,regular
sync,regular
games,regular
man,regular
lp,regular
mail,regular
news,regular
uucp,regular
proxy,regular
www-data,regular
backup,regular
list,regular
irc,regular
gnats,regular
nobody,vip
_apt,regular
rstudio-server,regular
rstudio,vip
systemd-network,regular
systemd-resolve,regular
messagebus,regular

6.7 Text editors

Source: Editor War on Wikipedia.

6.7.1 Emacs

  • Emacs is a powerful text editor with extensive support for many languages including R, \(\LaTeX\), python, and C/C++; however it’s not installed by default on many Linux distributions.

  • Basic survival commands:

    • emacs filename to open a file with emacs.
    • CTRL-x CTRL-f to open an existing or new file.
    • CTRL-x CTRX-s to save.
    • CTRL-x CTRL-w to save as.
    • CTRL-x CTRL-c to quit.
  • Google emacs cheatsheet

C-<key> means hold the control key, and press <key>.
M-<key> means press the Esc key once, and press <key>.

6.7.2 Vi

  • Vi is ubiquitous (POSIX standard). Learn at least its basics; otherwise you can edit nothing on some clusters.

  • Basic survival commands:

    • vi filename to start editing a file.
    • vi is a modal editor: insert mode and normal mode. Pressing i switches from the normal mode to insert mode. Pressing ESC switches from the insert mode to normal mode.
    • :x<Return> quits vi and saves changes.
    • :q!<Return> quits vi without saving latest changes.
    • :w<Return> saves changes.
    • :wq<Return> quits vi and saves changes.
  • Google vi cheatsheet

7 IDE (Integrated Development Environment)

  • Statisticians/data scientists write a lot of code. Critical to adopt a good IDE that goes beyond code editing: syntax highlighting, executing code within editor, debugging, profiling, version control, etc.

  • RStudio, Eclipse, Emacs, Matlab, Visual Studio, VS Code, etc.

8 Processes

8.1 Cancel a non-responding program

  • Press Ctrl+C to cancel a non-responding or long-running program.

8.2 Processes

  • OS runs processes on behalf of user.

  • Each process has Process ID (PID), Username (UID), Parent process ID (PPID), Time and data process started (STIME), time running (TIME), etc.

ps
  PID TTY          TIME CMD
  297 ?        00:00:16 rsession
  565 ?        00:00:00 quarto
  572 ?        00:00:04 deno
  936 ?        00:00:01 R
 1049 ?        00:00:00 sh
 1050 ?        00:00:00 ps
  • All current running processes:
ps -eaf
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 07:53 ?        00:00:00 s6-svscan -t0 /var/run/s6/services
root        38     1  0 07:53 ?        00:00:00 s6-supervise s6-fdholderd
root       252     1  0 07:53 ?        00:00:00 s6-supervise rstudio
rstudio+   258   252  0 07:53 ?        00:00:07 /usr/lib/rstudio-server/bin/rserver --server-daemonize 0
rstudio    297   258  1 07:53 ?        00:00:16 /usr/lib/rstudio-server/bin/rsession -u rstudio --session-use-secure-cookies 0 --session-root-path / --session-same-site 0 --session-use-file-storage 1 --launcher-token 18108BC5 --r-restore-workspace 2 --r-run-rprofile 2
rstudio    466   297  0 07:54 pts/0    00:00:00 bash -l
rstudio    565   297  0 08:12 ?        00:00:00 /bin/bash /usr/lib/rstudio-server/bin/quarto/bin/quarto preview linux.qmd --to html --no-watch-inputs --no-browse
rstudio    572   565  5 08:12 ?        00:00:04 /usr/lib/rstudio-server/bin/quarto/bin/tools/deno-x86_64-unknown-linux-gnu/deno run --unstable --no-config --cached-only --allow-read --allow-write --allow-run --allow-env --allow-net --allow-ffi --no-check --importmap=/usr/lib/rstudio-server/bin/quarto/bin/vendor/import_map.json /usr/lib/rstudio-server/bin/quarto/bin/quarto.js preview linux.qmd --to html --no-watch-inputs --no-browse
rstudio    936   572 78 08:13 ?        00:00:01 /usr/local/lib/R/bin/exec/R --no-save --no-restore --no-echo --no-restore --file=/usr/lib/rstudio-server/bin/quarto/share/rmd/rmd.R
rstudio   1051   936  0 08:13 ?        00:00:00 sh -c 'bash'  -c 'ps -eaf' 2>&1
rstudio   1052  1051  0 08:13 ?        00:00:00 ps -eaf
  • All Python processes:
ps -eaf | grep python
rstudio   1053   936  0 08:13 ?        00:00:00 sh -c 'bash'  -c 'ps -eaf | grep python' 2>&1
rstudio   1054  1053  0 08:13 ?        00:00:00 bash -c ps -eaf | grep python
rstudio   1056  1054  0 08:13 ?        00:00:00 grep python
  • Process with PID=1:
ps -fp 1
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 07:53 ?        00:00:00 s6-svscan -t0 /var/run/s6/services
  • All processes owned by a user:
ps -fu $USER
UID        PID  PPID  C STIME TTY          TIME CMD
rstudio    297   258  1 07:53 ?        00:00:16 /usr/lib/rstudio-server/bin/rsession -u rstudio --session-use-secure-cookies 0 --session-root-path / --session-same-site 0 --session-use-file-storage 1 --launcher-token 18108BC5 --r-restore-workspace 2 --r-run-rprofile 2
rstudio    466   297  0 07:54 pts/0    00:00:00 bash -l
rstudio    565   297  0 08:12 ?        00:00:00 /bin/bash /usr/lib/rstudio-server/bin/quarto/bin/quarto preview linux.qmd --to html --no-watch-inputs --no-browse
rstudio    572   565  5 08:12 ?        00:00:04 /usr/lib/rstudio-server/bin/quarto/bin/tools/deno-x86_64-unknown-linux-gnu/deno run --unstable --no-config --cached-only --allow-read --allow-write --allow-run --allow-env --allow-net --allow-ffi --no-check --importmap=/usr/lib/rstudio-server/bin/quarto/bin/vendor/import_map.json /usr/lib/rstudio-server/bin/quarto/bin/quarto.js preview linux.qmd --to html --no-watch-inputs --no-browse
rstudio    936   572 80 08:13 ?        00:00:01 /usr/local/lib/R/bin/exec/R --no-save --no-restore --no-echo --no-restore --file=/usr/lib/rstudio-server/bin/quarto/share/rmd/rmd.R
rstudio   1059   936  0 08:13 ?        00:00:00 sh -c 'bash'  -c 'ps -fu $USER' 2>&1
rstudio   1060  1059  0 08:13 ?        00:00:00 ps -fu rstudio

8.3 Kill processes

  • Kill process with PID=1001:
```{bash}
#| eval: false
kill 1001
```
  • Kill all R processes.
```{bash}
#| eval: false
killall -r R
```

8.4 top

  • top prints realtime process information (very useful).
```{bash}
#| eval: false
top
```

  • Exit the top program by pressing the q key.

9 Secure shell (SSH)

9.1 SSH

SSH (secure shell) is the dominant cryptographic network protocol for secure network connection via an insecure network.

  • On Linux or Mac Terminal, access a Linux machine by
```{bash}
#| eval: false
ssh [USERNAME]@[IP_ADDRESS]
```

Replace above [USERNAME] by your account user name on the Linux machine and [IP_ADDRESS] by the machine’s ip address. For example, to connect to the Hoffman2 cluster at UCLA

```{bash}
#| eval: false
ssh huazhou@hoffman2.idre.ucla.edu
```
  • For Windows users, there are at least three ways: (1) (recommended) Git Bash which is included in Git for Windows, (2) (not recommended) PuTTY program (free), or (3) (highly recommended) use WSL for Windows to install a full fledged Linux system within Windows.

9.2 Advantages of keys over password

  • Key authentication is more secure than password. Most passwords are weak.

  • Script or a program may need to systematically SSH into other machines.

  • Log into multiple machines using the same key.

  • Seamless use of many services: Git/GitHub, AWS or Google cloud service, parallel computing on multiple hosts, Travis CI (continuous integration) etc.

  • Many servers only allow key authentication and do not accept password authentication.

9.3 Key authentication

  • Public key. Put on the machine(s) you want to log in.

  • Private key. Put on your own computer. Consider this as the actual key in your pocket; never give private keys to others.

  • Messages from server to your computer is encrypted with your public key. It can only be decrypted using your private key.

  • Messages from your computer to server is signed with your private key (digital signatures) and can be verified by anyone who has your public key (authentication).

9.4 Steps to generate keys

  • On Linux, Mac, or Windows Git Bash, to generate a key pair:
ssh-keygen -t rsa -f ~/.ssh/[KEY_FILENAME] -C [USERNAME]
    • [KEY_FILENAME] is the name that you want to use for your SSH key files. For example, a filename of id_rsa generates a private key file named id_rsa and a public key file named id_rsa.pub.

    • [USERNAME] is the user for whom you will apply this SSH key.

    • Use a (optional) paraphrase different from password.

  • Set correct permissions on the .ssh folder and key files.
    • The permission for the ~/.ssh folder should be 700 (drwx------).
    • The permission of the private key ~/.ssh/id_rsa should be 600 (-rw-------).
    • The permission of the public key ~/.ssh/id_rsa.pub should be 644 (-rw-r--r--).
chmod 700 ~/.ssh
chmod 600 ~/.ssh/[KEY_FILENAME]
chmod 644 ~/.ssh/[KEY_FILENAME].pub
Note Windows is different, it doesn't allow change of permissions.
  • Append the public key to the ~/.ssh/authorized_keys file of any Linux machine we want to SSH to, e.g.,
ssh-copy-id -i ~/.ssh/[KEY_FILENAME] [USERNAME]@[IP_ADDRESS]

Make sure the permission of the authorized_keys file is 600 (-rw-------).

  • Test your new key.
ssh -i ~/.ssh/[KEY_FILENAME] [USERNAME]@[IP_ADDRESS]
  • From now on, you don’t need password each time you connect from your machine to the teaching server.

  • If you set paraphrase when generating keys, you’ll be prompted for the paraphrase each time the private key is used. Avoid repeatedly entering the paraphrase by using ssh-agent on Linux/Mac or Pagent on Windows.

  • Same key pair can be used between any two machines. We don’t need to regenerate keys for each new connection.

9.5 Transfer files between machines

  • scp securely transfers files between machines using SSH.
## copy file from local to remote
scp [LOCALFILE] [USERNAME]@[IP_ADDRESS]:/[PATH_TO_FOLDER]
## copy file from remote to local
scp [USERNAME]@[IP_ADDRESS]:/[PATH_TO_FILE] [PATH_TO_LOCAL_FOLDER]
  • sftp is FTP via SSH.

  • Globus is GUI program for securely transferring files between machines. To use Globus you will have to go to https://www.globus.org/ and login through UCLA by selecting your existing organizational login as UCLA. Then you will need to download their Globus Connect Personal software, then set your laptop as an endpoint. Very detailed instructions can be found at https://www.hoffman2.idre.ucla.edu/file-transfer/globus/.

  • GUIs for Windows (WinSCP) or Mac (Cyberduck).

  • You can even use RStudio to upload files to a remote machine with RStudio Server installed.

  • (Preferred way) Use a version control system (git, svn, cvs, …) to sync project files between different machines and systems.

9.6 Line breaks in text files

  • Windows uses a pair of CR and LF for line breaks.

  • Linux/Unix uses an LF character only.

  • MacOS X also uses a single LF character. But old Mac OS used a single CR character for line breaks.

  • If transferred in binary mode (bit by bit) between OSs, a text file could look a mess.

  • Most transfer programs automatically switch to text mode when transferring text files and perform conversion of line breaks between different OSs; but I used to run into problems using WinSCP. Sometimes you have to tell WinSCP explicitly a text file is being transferred.

10 Run R in Linux

10.1 Interactive mode

  • Start R in the interactive mode by typing R in shell.

  • Then run R script by

source("script.R")

10.2 Batch mode

  • Demo script meanEst.R implements an (terrible) estimator of mean \[ {\widehat \mu}_n = \frac{\sum_{i=1}^n x_i 1_{i \text{ is prime}}}{\sum_{i=1}^n 1_{i \text{ is prime}}}. \]
cat meanEst.R
## check if a given integer is prime
isPrime = function(n) {
  if (n <= 3) {
    return (TRUE)
  }
  if (any((n %% 2:floor(sqrt(n))) == 0)) {
    return (FALSE)
  }
  return (TRUE)
}

## estimate mean only using observation with prime indices
estMeanPrimes = function (x) {
  n = length(x)
  ind = sapply(1:n, isPrime)
  return (mean(x[ind]))
}

print(estMeanPrimes(rnorm(100000)))
  • To run your R code non-interactively aka in batch mode, we have at least two options:
#! eval: false
# default output to meanEst.Rout
R CMD BATCH meanEst.R

or

# output to stdout
Rscript meanEst.R
  • Typically automate batch calls using a scripting language, e.g., Python, Perl, and shell script.

10.3 Pass arguments to R scripts

  • Specify arguments in R CMD BATCH:
R CMD BATCH '--args mu=1 sig=2 kap=3' script.R
  • Specify arguments in Rscript:
Rscript script.R mu=1 sig=2 kap=3
  • Parse command line arguments using magic formula
for (arg in commandArgs(TRUE)) {
  eval(parse(text=arg))
}

in R script. After calling the above code, all command line arguments will be available in the global namespace.

  • To understand the magic formula commandArgs, run R by:
R '--args mu=1 sig=2 kap=3'

and then issue commands in R

commandArgs()
commandArgs(TRUE)
  • Understand the magic formula parse and eval:
rm(list = ls())
print(x)
parse(text = "x=3")
eval(parse(text = "x=3"))
print(x)
  • runSim.R has components: (1) command argument parser, (2) method implementation, (3) data generator with unspecified parameter n, and (4) estimation based on generated data.
## parsing command arguments
for (arg in commandArgs(TRUE)) {
  eval(parse(text=arg))
}

## check if a given integer is prime
isPrime = function(n) {
  if (n <= 3) {
    return (TRUE)
  }
  if (any((n %% 2:floor(sqrt(n))) == 0)) {
    return (FALSE)
  }
  return (TRUE)
}

## estimate mean only using observation with prime indices
estMeanPrimes = function (x) {
  n = length(x)
  ind = sapply(1:n, isPrime)
  return (mean(x[ind]))
}

# simulate data
x = rnorm(n)

# estimate mean
estMeanPrimes(x)
  • Call runSim.R with sample size n=100:
R CMD BATCH '--args n=100' runSim.R

or

Rscript runSim.R n=100
[1] 0.2459393

10.4 Run long jobs

  • Many statistical computing tasks take long: simulation, MCMC, etc. If we exit Linux when the job is unfinished, the job is killed.

  • nohup command in Linux runs program(s) immune to hangups and writes output to nohup.out by default. Logging out will not kill the process; we can log in later to check status and results.

  • nohup is POSIX standard thus available on Linux and MacOS.

  • Run runSim.R in background and writes output to nohup.out:

nohup Rscript runSim.R n=100 &
[1] 0.384066

The & at the end of the command instructs Linux to run this command in background, so we gain control of the terminal immediately.

10.5 screen

  • screen is another popular utility, but not installed by default.

  • Typical workflow using screen.

    1. Access remote server using ssh.

    2. Start jobs in batch mode.

    3. Detach jobs.

    4. Exit from server, wait for jobs to finish.

    5. Access remote server using ssh.

    6. Re-attach jobs, check on progress, get results, etc.

10.6 Use R to call R

R in conjunction with nohup (or screen) can be used to orchestrate a large simulation study.

  • It can be more elegant, transparent, and robust to parallelize jobs corresponding to different scenarios (e.g., different generative models) outside of the code used to do statistical computation.

  • We consider a simulation study in R but the same approach could be used with code written in Julia, Matlab, Python, etc.

  • Python in many ways makes a better glue.

  • Suppose we have

    • runSim.R which runs a simulation based on command line argument n.
    • A large collection of n values that we want to use in our simulation study.
    • Access to a server with 128 cores.
      How to parallelize the job?
  • Option 1: manually call runSim.R for each setting.

  • Option 2 (smarter): automate calls using R and nohup.

  • Let’s demonstrate using the script autoSim.R

cat autoSim.R
# autoSim.R

nVals <- seq(100, 1000, by=100)
for (n in nVals) {
  oFile <- paste("n", n, ".txt", sep="")
  sysCall <- paste("nohup Rscript runSim.R n=", n, " > ", oFile, sep="")
  system(sysCall, wait = FALSE)
  print(paste("sysCall=", sysCall, sep=""))
}

Note when we call bash command using the system function in R, we set optional argument wait=FALSE so that jobs can be run parallel.

Rscript autoSim.R
[1] "sysCall=nohup Rscript runSim.R n=100 > n100.txt"
[1] "sysCall=nohup Rscript runSim.R n=200 > n200.txt"
[1] "sysCall=nohup Rscript runSim.R n=300 > n300.txt"
[1] "sysCall=nohup Rscript runSim.R n=400 > n400.txt"
[1] "sysCall=nohup Rscript runSim.R n=500 > n500.txt"
[1] "sysCall=nohup Rscript runSim.R n=600 > n600.txt"
[1] "sysCall=nohup Rscript runSim.R n=700 > n700.txt"
[1] "sysCall=nohup Rscript runSim.R n=800 > n800.txt"
[1] "sysCall=nohup Rscript runSim.R n=900 > n900.txt"
[1] "sysCall=nohup Rscript runSim.R n=1000 > n1000.txt"
  • Now we just need to write a script to collect results from the output files.

  • Later we will learn how to coordinate large scale computation on UCLA Hoffman2 cluster, using Linux and R scripting.

11 Some other Linux commands

  • Log out Linux: exit or logout or ctrl+d.

  • Clear screen: clear.