R version 4.3.2 (2023-10-31)
Platform: aarch64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS: /usr/lib/aarch64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/aarch64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.3.2 fastmap_1.1.1 cli_3.6.2
[5] tools_4.3.2 htmltools_0.5.7 rstudioapi_0.15.0 yaml_2.3.8
[9] rmarkdown_2.25 knitr_1.45 jsonlite_1.8.8 xfun_0.42
[13] digest_0.6.34 rlang_1.1.3 evaluate_0.23
2Advanced R
To gain a deep understanding of how R works, the book Advanced R by Hadley Wickham is a must read. Read now to save numerous hours you might waste in future.
We cover select topics on coding style, benchmarking, profiling, debugging, parallel computing, byte code compiling, Rcpp, and package development.
In order to identify performance issue, we need to measure runtime accurately.
4.1system.time
set.seed(203)x <-runif(1e6)system.time({sqrt(x)})
user system elapsed
0.002 0.000 0.002
system.time({x ^0.5})
user system elapsed
0.006 0.000 0.007
system.time({exp(log(x) /2)})
user system elapsed
0.005 0.000 0.005
Check ?proc.time for the explanation:
The ‘user time’ is the CPU time charged for the execution of user instructions of the calling process. The ‘system time’ is the CPU time charged for execution by the system on behalf of the calling process.
From William Dunlap:
“User CPU time” gives the CPU time spent by the current process (i.e., the current R session) and “system CPU time” gives the CPU time spent by the kernel (the operating system) on behalf of the current process. The operating system is used for things like opening files, doing input or output, starting other processes, and looking at the system clock: operations that involve resources that many processes must share. Different operating systems will have different things done by the operating system.
Original code for centering columns of a dataframe:
library(profvis)profvis({# Store in another variable for this run data1 <- data# Get column means means <-apply(data1[, names(data1) !="id"], 2, mean)# Subtract mean from each columnfor (i inseq_along(means)) { data1[, names(data1) !="id"][, i] <- data1[, names(data1) !="id"][, i] - means[i] }})
Profile apply vs colMeans vs lapply vs vapply:
profvis({ data1 <- data# Four different ways of getting column means means <-apply(data1[, names(data1) !="id"], 2, mean) means <-colMeans(data1[, names(data1) !="id"]) means <-lapply(data1[, names(data1) !="id"], mean) means <-vapply(data1[, names(data1) !="id"], mean, numeric(1))})
We decide to use vapply:
profvis({ data1 <- data means <-vapply(data1[, names(data1) !="id"], mean, numeric(1))for (i inseq_along(means)) { data1[, names(data1) !="id"][, i] <- data1[, names(data1) !="id"][, i] - means[i] }})
Calculate mean and center in one pass:
profvis({ data1 <- data# Given a column, normalize values and return them col_norm <-function(col) {return(col -mean(col)) }# Apply the normalizer function over all columns except id data1[, names(data1) !="id"] <-lapply(data1[, names(data1) !="id"], col_norm)})
5.2 Example: profiling memory
Original code for cumulative sums:
profvis({ data <-data.frame(value =runif(1e5)) data$sum[1] <- data$value[1]for (i inseq(2, nrow(data))) { data$sum[i] <- data$sum[i-1] + data$value[i] }})
Write a function to avoid expensive indexing by $: