R version 4.3.2 (2023-10-31)
Platform: aarch64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS: /usr/lib/aarch64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/aarch64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.3.2 fastmap_1.1.1 cli_3.6.2
[5] tools_4.3.2 htmltools_0.5.7 rstudioapi_0.15.0 yaml_2.3.8
[9] rmarkdown_2.25 knitr_1.45 jsonlite_1.8.8 xfun_0.41
[13] digest_0.6.33 rlang_1.1.2 evaluate_0.23
1 Preface
This html is rendered from linux.qmd on Linux Ubuntu 22.04 (jammy).
Mac users can render linux.qmd directly. Some tools such as tree and locate need to be installed (follow the error messages).
Windows users need to install Git for Windows to render linux.qmd using Git Bash or install WSL (Windows Subsystem for Linux) to render linux.qmd using Ubuntu. Some tools such as tree and locate need to be installed (follow the error messages).
Both Mac and Windows users can also use Docker to render linux.qmd within a Ubuntu container. Details in Lab 3.
Set Bash engine in Windows
Windows Git Bash users need to set bash engine to Git Bash:
# only on Windows Git Bashknitr::opts_chunk$set(engine.path =list(bash ="C:\\Program\ Files\\Git\\bin\\bash.exe"))
or similarly to WSL.
In this lecture, most code chunks are bash commands instead of R code.
2 Why Linux
Linux is the most common platform for scientific computing and deployment of data science tools.
Open source and community support.
Things break; when they break using Linux, it’s easy to fix.
Scalability: portable devices (Android, iOS), laptops, servers, clusters, and super-computers.
E.g. UCLA Hoffmann2 cluster runs on Linux; most machines in cloud (AWS, Azure, GCP) run on Linux.
Debian/Ubuntu is a popular choice for personal computers.
RHEL/CentOS is popular on servers. (In December 2020, Red Hat terminated the development of CentOS Linux distribution.)
UCLA Hoffman2 cluster runs CentOS 7.9.2009 (as of 2024-01-01).
MacOS was originally derived from Unix/Linux (Darwin kernel). It is POSIX compliant. Most shell commands we review here apply to MacOS terminal as well.
Windows/DOS, unfortunately, is a totally different breed.
The -l option indicates it should be a login shell.
Change your login shell permanently:
chsh-s /bin/bash [USERNAME]
Then log out and log in.
4.2 Command history and bash completion
We can navigate to previous/next commands by the upper and lower keys, or maintain a command history stack using pushd and popd commands.
Bash provides the following standard completion for the Linux users by default. Much less typing errors and time!
Pathname completion.
Filename completion.
Variablename completion: echo $[TAB][TAB].
Username completion: cd ~[TAB][TAB].
Hostname completion ssh huazhou@[TAB][TAB].
It can also be customized to auto-complete other stuff such as options and command’s arguments. Google bash completion for more information.
4.3man is man’s best friend
Online help for shell commands: man [COMMANDNAME].
# display the first 30 lines of documentation for the ls commandman ls |head-30
LS(1) User Commands LS(1)
NAME
ls - list directory contents
SYNOPSIS
ls [OPTION]... [FILE]...
DESCRIPTION
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of -cftuvSUX nor --sort is speci‐
fied.
Mandatory arguments to long options are mandatory for short options
too.
-a, --all
do not ignore entries starting with .
-A, --almost-all
do not list implied . and ..
--author
with -l, print the author of each file
-b, --escape
print C-style escapes for nongraphic characters
--block-size=SIZE
with -l, scale sizes by SIZE when printing them; e.g.,
5 Navigate file system
5.1 Linux directory structure
Upon log in, user is at his/her home directory.
tree command (if installed) displays directory structure. tree -L levels display levels directories deep.
# display only directories in levels 1, 2 from root directorytree-d-L 1 /
touch creates a text file; if file already exists, it’s left unchanged.
rm deletes a file.
mkdir creates a new directory.
rmdir deletes an empty directory.
rm -rf deletes a directory and all contents in that directory (be cautious using the -f option …).
5.5 Find files
locate locates a file by name (need mlocate program installed, not POSIX standard):
locate linux.qmd
find is similar to locate but has more functionalities, e.g., select files by age, size, permissions, …. , and is ubiquitous.
# search within current folderfind linux.qmd
linux.qmd
# search within the parent folderfind .. -name linux.qmd
../02-linux/linux.qmd
which locates a program (executable file):
which R
/usr/local/bin/R
5.6 Wildcard characters
Wildcard
Matches
?
any single character
*
any character 0 or more times
+
one or more preceding pattern
^
beginning of the line
$
end of the line
[set]
any character in set
[!set]
any character not in set
[a-z]
any lowercase letter
[0-9]
any number (same as [0123456789])
Example:
# all png files in current folderls-l*.png
-rw-r--r-- 1 504 rstudio 321281 Mar 11 2023 key_authentication_1.png
-rw-r--r-- 1 504 rstudio 96119 Mar 11 2023 key_authentication_2.png
-rw-r--r-- 1 504 rstudio 11662 Mar 11 2023 linux_directory_structure.png
-rw-r--r-- 1 504 rstudio 42472 Mar 11 2023 linux_filepermission_oct.png
-rw-r--r-- 1 504 rstudio 102188 Mar 11 2023 linux_filepermission.png
-rw-r--r-- 1 504 rstudio 437112 Mar 11 2023 redhat_kills_centos.png
-rw-r--r-- 1 504 rstudio 141962 Mar 11 2023 Richard_Stallman_2013.png
-rw-r--r-- 1 504 rstudio 685657 Mar 11 2023 screenshot_top.png
5.7 Regular expression
Wildcards are examples of regular expressions.
Regular expressions (regex) are a powerful tool to efficiently sift through large amounts of text: record linking, data cleaning, scraping data from website or other data-feed.
## parsing command arguments
for (arg in commandArgs(TRUE)) {
eval(parse(text=arg))
}
## check if a given integer is prime
isPrime = function(n) {
if (n <= 3) {
return (TRUE)
}
if (any((n %% 2:floor(sqrt(n))) == 0)) {
return (FALSE)
}
return (TRUE)
}
## estimate mean only using observation with prime indices
estMeanPrimes = function (x) {
n = length(x)
ind = sapply(1:n, isPrime)
return (mean(x[ind]))
}
# simulate data
x = rnorm(n)
# estimate mean
estMeanPrimes(x)
Be cautious to cat large text files.
head prints the first 10 lines of a file:
head runSim.R
## parsing command arguments
for (arg in commandArgs(TRUE)) {
eval(parse(text=arg))
}
## check if a given integer is prime
isPrime = function(n) {
if (n <= 3) {
return (TRUE)
}
head -l prints the first \(l\) lines of a file:
head-15 runSim.R
## parsing command arguments
for (arg in commandArgs(TRUE)) {
eval(parse(text=arg))
}
## check if a given integer is prime
isPrime = function(n) {
if (n <= 3) {
return (TRUE)
}
if (any((n %% 2:floor(sqrt(n))) == 0)) {
return (FALSE)
}
return (TRUE)
}
tail prints the last 10 lines of a file:
tail runSim.R
n = length(x)
ind = sapply(1:n, isPrime)
return (mean(x[ind]))
}
# simulate data
x = rnorm(n)
# estimate mean
estMeanPrimes(x)
tail -l prints the last \(l\) lines of a file:
tail-15 runSim.R
return (TRUE)
}
## estimate mean only using observation with prime indices
estMeanPrimes = function (x) {
n = length(x)
ind = sapply(1:n, isPrime)
return (mean(x[ind]))
}
# simulate data
x = rnorm(n)
# estimate mean
estMeanPrimes(x)
tail -n +NUM outputs starting with line NUM:
tail-n +20 runSim.R
ind = sapply(1:n, isPrime)
return (mean(x[ind]))
}
# simulate data
x = rnorm(n)
# estimate mean
estMeanPrimes(x)
Questions:
How to see the 11th line of the file and nothing else?
What about the 11th to the last line?
6.2 Piping and redirection
| sends output from one command as input of another command.
ls-l|head-5
total 7520
-rw-r--r-- 1 504 rstudio 258 Mar 11 2023 autoSim.R
-rw-r--r-- 1 504 rstudio 110345 Mar 11 2023 Emacs_Reference_Card.pdf
-rw-r--r-- 1 504 rstudio 157353 Mar 11 2023 IDRE_Winter_2019_Workshops.pdf
-rw-r--r-- 1 504 rstudio 321281 Mar 11 2023 key_authentication_1.png
> directs output from one command to a file.
>> appends output from one command to a file.
< reads input from a file.
Combinations of shell commands (grep, sed, awk, …), piping and redirection, and regular expressions allow us pre-process and reformat huge text files efficiently. See HW1.
6.3less is more; more is less
more browses a text file screen by screen (only downwards). Scroll down one page (paging) by pressing the spacebar; exit by pressing the q key.
less is also a pager, but has more functionalities, e.g., scroll upwards and downwards through the input.
less doesn’t need to read the whole file, i.e., it loads files faster than more.
6.4grep
grep prints lines that match an expression:
Show lines that contain string CentOS:
# quotes not necessary if not a regular expressiongrep'CentOS' linux.qmd
- RHEL/CentOS is popular on servers. (In December 2020, Red Hat terminated the development of CentOS Linux distribution.)
- UCLA Hoffman2 cluster runs CentOS 7.9.2009 (as of 2024-01-01).
- Show lines that contain string `CentOS`:
grep 'CentOS' linux.qmd
grep 'CentOS' *.qmd
grep -n 'CentOS' linux.qmd
- Replace `CentOS` by `RHEL` in a text file:
sed 's/CentOS/RHEL/' linux.qmd | grep RHEL
Search multiple text files:
grep'CentOS'*.qmd
- RHEL/CentOS is popular on servers. (In December 2020, Red Hat terminated the development of CentOS Linux distribution.)
- UCLA Hoffman2 cluster runs CentOS 7.9.2009 (as of 2024-01-01).
- Show lines that contain string `CentOS`:
grep 'CentOS' linux.qmd
grep 'CentOS' *.qmd
grep -n 'CentOS' linux.qmd
- Replace `CentOS` by `RHEL` in a text file:
sed 's/CentOS/RHEL/' linux.qmd | grep RHEL
Show matching line numbers:
grep-n'CentOS' linux.qmd
65:- RHEL/CentOS is popular on servers. (In December 2020, Red Hat terminated the development of CentOS Linux distribution.)
71:- UCLA Hoffman2 cluster runs CentOS 7.9.2009 (as of 2024-01-01).
378:- Show lines that contain string `CentOS`:
381:grep 'CentOS' linux.qmd
386:grep 'CentOS' *.qmd
391:grep -n 'CentOS' linux.qmd
408:- Replace `CentOS` by `RHEL` in a text file:
410:sed 's/CentOS/RHEL/' linux.qmd | grep RHEL
Find all files in current directory with .png extension:
drwxr-xr-x 20 504 rstudio 640 Jan 3 03:31 .
drwxr-xr-x 21 504 rstudio 672 Jan 2 03:27 ..
6.5sed
sed is a stream editor.
Replace CentOS by RHEL in a text file:
sed's/CentOS/RHEL/' linux.qmd |grep RHEL
- RHEL/RHEL is popular on servers. (In December 2020, Red Hat terminated the development of CentOS Linux distribution.)
- UCLA Hoffman2 cluster runs RHEL 7.9.2009 (as of 2024-01-01).
- Show lines that contain string `RHEL`:
grep 'RHEL' linux.qmd
grep 'RHEL' *.qmd
grep -n 'RHEL' linux.qmd
- Replace `RHEL` by `RHEL` in a text file:
sed 's/RHEL/RHEL/' linux.qmd | grep RHEL
6.6awk
awk is a filter and report writer.
First let’s display the content of the file /etc/passwd (this file only exists in Linux and MacOS):
Each line contains fields (1) user name, (2) password, (3) user ID, (4) group ID, (5) user ID info, (6) home directory, and (7) command shell, separated by :.
Print sorted list of login names:
awk-F:'{ print $1 }' /etc/passwd |sort|head-10
_apt
backup
bin
daemon
games
gnats
irc
list
lp
mail
Print number of lines in a file, as NR stands for Number of Rows:
Vi is ubiquitous (POSIX standard). Learn at least its basics; otherwise you can edit nothing on some clusters.
Basic survival commands:
vi filename to start editing a file.
vi is a modal editor: insert mode and normal mode. Pressing i switches from the normal mode to insert mode. Pressing ESC switches from the insert mode to normal mode.
:x<Return> quits vi and saves changes.
:q!<Return> quits vi without saving latest changes.
:w<Return> saves changes.
:wq<Return> quits vi and saves changes.
Google vi cheatsheet:
6.7.2 Emacs
Emacs is a powerful text editor with extensive support for many languages including R, \(\LaTeX\), python, and C/C++; however it’s not installed by default on many Linux distributions.
Basic survival commands:
emacs filename to open a file with emacs.
CTRL-x CTRL-f to open an existing or new file.
CTRL-x CTRX-s to save.
CTRL-x CTRL-w to save as.
CTRL-x CTRL-c to quit.
Google emacs cheatsheet
C-<key> means hold the control key, and press <key>. M-<key> means press the Esc key once, and press <key>.
7 IDE (Integrated Development Environment)
Statisticians/data scientists write a lot of code. Critical to adopt a good IDE that goes beyond code editing: syntax highlighting, executing code within editor, debugging, profiling, version control, etc.
RStudio, JupyterLab, Eclipse, Emacs, Matlab, Visual Studio, VS Code, etc.
8 Processes
8.1 Cancel a non-responding program
Press Ctrl+C to cancel a non-responding or long-running program.
8.2 Processes
OS runs processes on behalf of user.
Each process has Process ID (PID), Username (UID), Parent process ID (PPID), Time and data process started (STIME), time running (TIME), etc.
ps
PID TTY TIME CMD
401 ? 00:01:54 rsession
27052 ? 00:00:00 quarto
27060 ? 00:00:02 deno
27620 ? 00:00:01 R
27760 ? 00:00:00 sh
27761 ? 00:00:00 ps
Replace above [USERNAME] by your account user name on the Linux machine and [IP_ADDRESS] by the machine’s ip address. For example, to connect to the Hoffman2 cluster at UCLA
For Windows users, there are at least three ways: (1) (recommended) Git Bash which is included in Git for Windows, (2) (not recommended) PuTTY program (free), or (3) (highly recommended) use WSL for Windows to install a full fledged Linux system within Windows.
9.2 Advantages of keys over password
Key authentication is more secure than password. Most passwords are weak.
Script or a program may need to systematically SSH into other machines.
Log into multiple machines using the same key.
Seamless use of many services: Git/GitHub, AWS or Google cloud service, parallel computing on multiple hosts, Travis CI (continuous integration) etc.
Many servers only allow key authentication and do not accept password authentication.
9.3 Key authentication
Public key. Put on the machine(s) you want to log in.
Private key. Put on your own computer. Consider this as the actual key in your pocket; never give private keys to others.
Messages from server to your computer is encrypted with your public key. It can only be decrypted using your private key.
Messages from your computer to server is signed with your private key (digital signatures) and can be verified by anyone who has your public key (authentication).
9.4 Steps to generate keys (on ternminal)
On Linux, Mac, Windows Git Bash, or Windows WSL, to generate a key pair:
[KEY_FILENAME] is the name that you want to use for your SSH key files. For example, a filename of id_rsa generates a private key file named id_rsa and a public key file named id_rsa.pub.
[USERNAME] is the user for whom you will apply this SSH key.
Use a (optional) paraphrase different from password.
Set correct permissions on the .ssh folder and key files.
The permission for the ~/.ssh folder should be 700 (drwx------).
The permission of the private key ~/.ssh/id_rsa should be 600 (-rw-------).
The permission of the public key ~/.ssh/id_rsa.pub should be 644 (-rw-r--r--).
From now on, you don’t need password each time you connect from your machine to the teaching server.
If you set paraphrase when generating keys, you’ll be prompted for the paraphrase each time the private key is used. Avoid repeatedly entering the paraphrase by using ssh-agent on Linux/Mac or Pagent on Windows.
Same key pair can be used between any two machines. We don’t need to regenerate keys for each new connection.
9.5 Transfer files between machines
scp securely transfers files between machines using SSH.
## copy file from local to remotescp[LOCALFILE][USERNAME]@[IP_ADDRESS]:/[PATH_TO_FOLDER]
## copy file from remote to localscp[USERNAME]@[IP_ADDRESS]:/[PATH_TO_FILE][PATH_TO_LOCAL_FOLDER]
sftp is FTP via SSH.
Globus is GUI program for securely transferring files between machines. To use Globus you will have to go to https://www.globus.org/ and login through UCLA by selecting your existing organizational login as UCLA. Then you will need to download their Globus Connect Personal software, then set your laptop as an endpoint. Very detailed instructions can be found at https://www.hoffman2.idre.ucla.edu/file-transfer/globus/.
GUIs for Windows (WinSCP) or Mac (Cyberduck).
You can even use RStudio to upload files to a remote machine with RStudio Server installed.
(Preferred way) Use a version control system (git, svn, cvs, …) to sync project files between different machines and systems.
9.6 Line breaks in text files
Windows uses a pair of CR and LF for line breaks.
Linux/Unix uses an LF character only.
MacOS X also uses a single LF character. But old Mac OS used a single CR character for line breaks.
If transferred in binary mode (bit by bit) between OSs, a text file could look a mess.
Most transfer programs automatically switch to text mode when transferring text files and perform conversion of line breaks between different OSs; but I used to run into problems using WinSCP. Sometimes you have to tell WinSCP explicitly a text file is being transferred.
10 Run R in Linux
10.1 Interactive mode
Start R in the interactive mode by typing R in shell.
Then run R script by
source("script.R")
10.2 Batch mode
Demo script meanEst.R implements an (terrible) estimator of mean \[
{\widehat \mu}_n = \frac{\sum_{i=1}^n x_i 1_{i \text{ is prime}}}{\sum_{i=1}^n 1_{i \text{ is prime}}}.
\]
cat meanEst.R
## check if a given integer is prime
isPrime = function(n) {
if (n <= 3) {
return (TRUE)
}
if (any((n %% 2:floor(sqrt(n))) == 0)) {
return (FALSE)
}
return (TRUE)
}
## estimate mean only using observation with prime indices
estMeanPrimes = function (x) {
n = length(x)
ind = sapply(1:n, isPrime)
return (mean(x[ind]))
}
print(estMeanPrimes(rnorm(100000)))
To run your R code non-interactively aka in batch mode, we have at least two options:
#! eval: false# default output to meanEst.RoutR CMD BATCH meanEst.R
or
# output to stdoutRscript meanEst.R
Typically we automate batch calls using a scripting language, e.g., Python, Perl, and shell script.
10.3 Pass arguments to R scripts
Specify arguments in R CMD BATCH:
R CMD BATCH '--args mu=1 sig=2 kap=3' script.R
Specify arguments in Rscript:
Rscript script.R mu=1 sig=2 kap=3
Parse command line arguments using magic formula
for (arg incommandArgs(TRUE)) {eval(parse(text=arg))}
in R script. After calling the above code, all command line arguments will be available in the global namespace.
To understand the magic formula commandArgs, run R by:
runSim.R has components: (1) command argument parser, (2) method implementation, (3) data generator with unspecified parameter n, and (4) estimation based on generated data.
## parsing command arguments
for (arg in commandArgs(TRUE)) {
eval(parse(text=arg))
}
## check if a given integer is prime
isPrime = function(n) {
if (n <= 3) {
return (TRUE)
}
if (any((n %% 2:floor(sqrt(n))) == 0)) {
return (FALSE)
}
return (TRUE)
}
## estimate mean only using observation with prime indices
estMeanPrimes = function (x) {
n = length(x)
ind = sapply(1:n, isPrime)
return (mean(x[ind]))
}
# simulate data
x = rnorm(n)
# estimate mean
estMeanPrimes(x)
Call runSim.R with sample size n=100:
R CMD BATCH '--args n=100' runSim.R
or
Rscript runSim.R n=100
[1] 0.07827219
10.4 Run long jobs
Many statistical computing tasks take long: simulation, MCMC, etc. If we exit Linux when the job is unfinished, the job is killed.
nohup command in Linux runs program(s) immune to hangups and writes output to nohup.out by default. Logging out will not kill the process; we can log in later to check status and results.
nohup is POSIX standard thus available on Linux and MacOS.
Run runSim.R in background and writes output to nohup.out:
nohup Rscript runSim.R n=100 &
[1] -0.3518655
The & at the end of the command instructs Linux to run this command in background, so we gain control of the terminal immediately.
10.5 screen
screen is another popular utility, but not installed by default.
Typical workflow using screen.
Access remote server using ssh.
Start jobs in batch mode.
Detach jobs.
Exit from server, wait for jobs to finish.
Access remote server using ssh.
Re-attach jobs, check on progress, get results, etc.
10.6 Use R to call R
R in conjunction with nohup (or screen) can be used to orchestrate a large simulation study.
It can be more elegant, transparent, and robust to parallelize jobs corresponding to different scenarios (e.g., different generative models) outside of the code used to do statistical computation.
We consider a simulation study in R but the same approach could be used with code written in Julia, Matlab, Python, etc.
Python in many ways makes a better glue.
Suppose we have
runSim.R which runs a simulation based on command line argument n.
A large collection of n values that we want to use in our simulation study.
Access to a server with 128 cores.
How to parallelize the job?
Option 1: manually call runSim.R for each setting.
Option 2 (smarter): automate calls using R and nohup.