Select Topics in R Programming - Part I

Biostat 203B

Author

Dr. Hua Zhou @ UCLA

Published

March 13, 2024

1 Preamble

Display machine information for reproducibility.

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/aarch64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/aarch64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.2    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.8    xfun_0.42        
[13] digest_0.6.34     rlang_1.1.3       evaluate_0.23    

2 Advanced R

To gain a deep understanding of how R works, the book Advanced R by Hadley Wickham is a must read. Read now to save numerous hours you might waste in future.

We cover select topics on coding style, benchmarking, profiling, debugging, parallel computing, byte code compiling, Rcpp, and package development.

3 Style

4 Benchmark

Sources:

In order to identify performance issue, we need to measure runtime accurately.

4.1 system.time

set.seed(203)
x <- runif(1e6)

system.time({sqrt(x)})
   user  system elapsed 
  0.002   0.000   0.002 
system.time({x ^ 0.5})
   user  system elapsed 
  0.006   0.000   0.007 
system.time({exp(log(x) / 2)})
   user  system elapsed 
  0.005   0.000   0.005 

Check ?proc.time for the explanation:

The ‘user time’ is the CPU time charged for the execution of user instructions of the calling process. The ‘system time’ is the CPU time charged for execution by the system on behalf of the calling process.

From William Dunlap:

“User CPU time” gives the CPU time spent by the current process (i.e., the current R session) and “system CPU time” gives the CPU time spent by the kernel (the operating system) on behalf of the current process. The operating system is used for things like opening files, doing input or output, starting other processes, and looking at the system clock: operations that involve resources that many processes must share. Different operating systems will have different things done by the operating system.

start <- Sys.time()
y <- exp(log(x) / 2)
sum_y <- sum(y)
end <- Sys.time()

end - start
Time difference of 0.0170536 secs

4.2 microbenchmark

library("microbenchmark")
library("ggplot2")

mbm <- microbenchmark(
  sqrt(x),
  x ^ 0.5,
  exp(log(x) / 2),
  times = 100
)
mbm
Unit: milliseconds
          expr      min       lq     mean   median       uq      max neval cld
       sqrt(x) 1.610167 1.693230 1.981390 1.816563 1.985584 3.191334   100 a  
         x^0.5 6.323918 6.580500 6.902627 6.733667 7.072250 8.297625   100  b 
 exp(log(x)/2) 5.116167 5.263605 5.647292 5.395001 6.167064 7.244501   100   c

Results from microbenchmark can be nicely plotted in base R or ggplot2.

boxplot(mbm)

autoplot(mbm)

4.3 bench

The bench package is another tool for microbenchmarking. The output is a tibble:

library(bench)

(lb <- bench::mark(
  sqrt(x),
  x ^ 0.5,
  exp(log(x) / 2)
))
# A tibble: 3 × 6
  expression         min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 sqrt(x)         1.43ms   1.64ms      609.    7.63MB    304. 
2 x^0.5            6.3ms   6.55ms      153.    7.63MB     29.5
3 exp(log(x)/2)   5.19ms   5.35ms      187.    7.63MB     37.4

To visualize the result

plot(lb)
Loading required namespace: tidyr

It is colored according to GC (garbage collection) levels.

5 Profiling

Premature optimization is the root of all evil (or at least most of it) in programming.
-Don Knuth

Sources:

5.1 Example: profiling time

First generate test data:

times <- 4e5
cols <- 150
data <- as.data.frame(x = matrix(rnorm(times * cols, mean = 5), ncol = cols))
data <- cbind(id = paste0("g", seq_len(times)), data)
head(data)
  id       V1       V2       V3       V4       V5       V6       V7       V8
1 g1 4.507591 4.948608 3.817944 4.596002 4.301750 6.419789 4.620631 5.963684
2 g2 4.269550 4.674749 4.613203 2.122983 3.768296 4.292365 4.955401 5.014855
3 g3 5.113355 4.614745 4.500319 4.259277 4.812196 4.307466 5.257527 3.879463
4 g4 5.650807 5.407914 5.275643 5.861464 2.324446 3.264727 5.408597 4.507168
5 g5 4.599654 4.759382 4.884349 5.579460 5.800788 4.795471 6.024505 5.909373
6 g6 6.876879 5.517926 5.643089 3.483676 3.612655 4.515598 4.930901 5.819108
        V9      V10      V11      V12      V13      V14      V15      V16
1 5.046695 4.754169 5.095201 5.149990 3.726476 5.696773 4.263394 4.862385
2 4.962743 6.074990 4.558015 4.786401 5.553230 6.275602 6.177083 5.102095
3 6.012066 4.361054 5.537735 4.073976 1.792642 5.416899 5.141389 5.989319
4 4.074255 5.523890 4.639627 5.407497 6.207404 3.856961 5.949720 6.686026
5 4.138026 3.682854 4.622768 3.132228 6.382126 5.062196 5.736041 6.642561
6 4.692801 2.922151 4.097171 4.064394 5.789938 5.727504 5.285034 5.377427
       V17      V18      V19      V20      V21      V22      V23      V24
1 4.991008 4.459578 6.048417 4.266364 3.807513 5.283473 6.032422 5.368753
2 5.532549 5.979006 6.303026 5.401933 4.531284 5.001930 4.138563 6.121215
3 4.788855 5.405929 5.676899 4.358891 2.472142 6.022590 4.901663 5.316409
4 4.530390 4.593472 5.600979 3.622897 5.861879 4.144052 4.558184 4.006792
5 6.363592 7.301809 5.304116 4.497104 3.357192 5.276580 4.201096 5.479395
6 5.889729 4.746235 6.116987 3.572043 4.499701 4.066766 5.082184 5.079333
       V25      V26      V27      V28      V29      V30      V31      V32
1 6.178755 5.469035 5.133563 6.013357 5.823264 4.703017 4.825605 6.548529
2 5.738477 6.094853 5.750052 5.633534 3.962843 3.845614 4.905880 6.542503
3 4.594617 4.406782 4.139983 4.110194 4.510571 3.835070 2.324135 4.640877
4 5.000650 5.333070 3.641680 3.776053 4.320409 3.795953 4.699010 4.661010
5 4.518155 4.133883 6.555049 4.690443 5.408843 5.908734 3.115737 4.767881
6 5.709747 4.972326 4.623785 6.024156 4.165088 3.814495 3.754400 5.362111
       V33      V34      V35      V36      V37      V38      V39      V40
1 6.074201 4.637895 5.963162 5.149934 6.145046 5.162786 4.370661 4.547202
2 5.346729 2.993135 5.959419 5.121172 4.359308 4.338535 3.384428 4.862964
3 5.619098 6.544011 5.967610 4.690218 4.534544 6.418835 5.272990 6.066150
4 5.144772 4.505196 4.279054 6.429831 2.782665 4.754179 6.096041 4.545527
5 6.812343 6.028331 6.019221 3.853566 5.844935 5.311485 5.450077 4.353076
6 5.501419 4.667838 6.482780 5.861671 3.817286 4.780757 5.605666 6.190851
       V41      V42      V43      V44      V45      V46      V47      V48
1 5.720116 6.211222 4.815094 3.921329 4.901294 4.546117 3.447361 4.078494
2 5.185166 3.793884 4.909923 4.864893 5.409808 6.253952 7.121168 4.464691
3 5.790955 4.376616 4.829560 4.438646 4.773821 5.812942 5.883623 3.733891
4 6.124777 6.263156 5.297797 5.520871 5.556417 5.619737 4.395538 6.496174
5 5.092493 5.189021 5.727976 6.155475 4.059136 3.806145 6.594611 3.969410
6 2.883992 4.750415 5.655113 4.370001 5.314489 6.293899 5.402266 5.593141
       V49      V50      V51      V52      V53      V54      V55      V56
1 4.551983 5.819125 5.762279 3.168550 3.505140 5.285660 4.724664 4.494607
2 4.577767 3.380670 4.107821 5.567447 5.250366 4.226882 3.690066 4.982820
3 4.868485 5.359673 3.528811 3.155938 4.580974 6.374070 6.116970 3.327392
4 5.513688 6.131716 5.181182 4.861411 4.193219 5.258645 5.735494 5.230569
5 5.437288 5.773874 3.303768 5.027765 5.018502 5.694614 6.791881 4.111746
6 6.713024 4.526375 5.043547 7.137926 5.425495 5.781473 5.541498 5.429820
       V57      V58      V59      V60      V61      V62      V63      V64
1 4.296691 4.377400 4.376483 5.321198 6.388320 5.064709 5.162853 4.019591
2 4.848653 4.905347 2.323099 6.453359 4.552458 3.930554 5.690574 3.547708
3 3.601917 5.726875 5.414916 5.644300 5.094185 4.134423 5.415775 4.923046
4 5.222174 4.696431 4.821157 6.212127 4.865990 3.373336 5.444049 4.595454
5 4.922701 3.824046 5.897676 6.529609 3.886122 7.274337 5.548672 2.549356
6 3.441980 5.025678 5.080832 5.788324 5.100784 6.761977 4.398632 4.000356
       V65      V66      V67      V68      V69      V70      V71      V72
1 4.923512 4.664266 5.878825 4.392025 7.013793 5.039744 6.347794 4.368044
2 5.331035 5.117373 6.233275 6.793568 5.372448 5.145313 7.825414 5.526237
3 5.941335 6.497956 4.425859 5.583389 4.909189 4.828701 5.618557 4.298644
4 4.698041 6.121371 4.542766 3.517158 5.357803 5.013158 4.986922 4.778756
5 4.707375 5.244703 4.310720 5.789077 6.506347 5.560654 5.717559 5.137161
6 3.394732 5.482108 3.883088 5.819925 4.799342 5.215981 4.520553 5.387043
       V73      V74      V75      V76      V77      V78      V79      V80
1 4.683822 6.139306 4.104005 5.707957 3.740388 5.814773 4.513040 4.183421
2 4.146756 5.307924 5.272356 4.848509 4.895523 4.771086 4.626003 4.563771
3 5.087736 5.314075 4.901544 2.297678 3.411354 4.975285 5.279001 4.003360
4 6.764912 5.133256 5.747375 5.062046 4.886778 3.404264 4.741954 3.499389
5 4.542173 3.273382 4.778795 5.926067 5.003578 4.058655 5.736812 2.711732
6 5.738357 4.134741 4.755400 4.513462 5.388790 5.864513 5.059827 6.890566
       V81      V82      V83      V84      V85      V86      V87      V88
1 4.526883 4.783565 4.786384 5.555627 6.337569 4.396340 3.585259 4.715301
2 4.784796 5.192889 5.589974 5.355136 5.163703 3.290003 4.529233 6.274656
3 5.301133 5.348789 3.370633 4.999918 4.997089 6.256013 5.775012 3.959496
4 4.802881 4.343019 5.508328 3.977580 6.031508 4.182687 5.777253 6.092924
5 5.538353 4.863254 5.742152 6.681762 7.059251 3.290441 5.597650 5.690200
6 3.587473 4.722587 4.515208 4.414284 4.703778 5.835368 3.560059 5.136654
       V89      V90      V91      V92      V93      V94      V95      V96
1 5.493883 5.873617 4.644489 6.494478 4.032862 4.936308 4.573684 4.456069
2 5.407064 6.067520 3.321282 5.610636 4.795262 3.988340 5.417969 5.147895
3 3.782407 4.704428 5.925810 4.893748 6.076285 5.311923 4.270228 4.949681
4 5.154203 4.390103 4.237873 5.812461 5.108407 3.775063 7.349431 4.240232
5 5.944214 3.499390 3.859823 3.968412 5.917704 4.635817 3.402167 5.649226
6 3.540830 4.545460 5.138707 4.548875 4.409225 7.133449 5.784451 3.371152
       V97      V98      V99     V100     V101     V102     V103     V104
1 3.711938 4.251340 3.624188 6.406170 3.973471 5.486686 4.105118 4.789627
2 7.000143 3.262676 4.668057 4.605993 5.950100 4.820340 6.160594 5.234965
3 5.855436 5.531985 5.319734 5.191446 4.819705 4.507898 4.086777 3.424632
4 5.656077 6.309154 3.880870 4.954607 5.784981 4.356224 6.637017 5.471912
5 3.998176 3.991265 3.039439 5.794197 3.888120 5.046702 5.936344 3.365480
6 4.161558 3.560711 5.743837 4.710821 4.484502 3.233531 7.280418 6.836993
      V105     V106     V107     V108     V109     V110     V111     V112
1 5.769201 5.565115 6.183324 5.520009 4.634011 3.613336 4.177035 4.881119
2 7.276156 2.853189 5.852580 5.860316 6.214701 4.966156 5.383706 5.487937
3 5.871512 5.255832 4.578514 3.959603 4.196363 5.474006 4.666215 5.840079
4 5.460103 5.173652 4.759960 6.171854 6.801173 4.210054 5.237202 3.861530
5 3.724461 5.747829 4.822595 6.012881 4.054244 4.637784 4.093999 4.020042
6 7.940341 5.811296 5.930907 7.809427 4.251641 4.748225 5.094627 5.110725
      V113     V114     V115     V116     V117     V118     V119     V120
1 4.987146 4.913228 5.110334 3.772343 2.682483 6.211539 4.850814 5.599100
2 4.834674 5.127217 4.204640 4.526421 3.508721 3.650036 4.052185 5.913104
3 5.452899 4.426930 5.518171 6.087777 5.081648 6.972123 5.959294 5.673686
4 2.263969 4.343241 3.805550 5.354741 4.833048 4.516973 5.173126 4.539688
5 8.170309 6.808176 4.694113 6.230110 5.743177 4.235224 7.307878 5.799364
6 5.307912 6.842500 3.400969 4.412704 4.695000 5.356514 5.315039 6.394183
      V121     V122     V123     V124     V125     V126     V127     V128
1 4.837811 3.176587 5.723621 3.626873 5.237398 5.773331 6.400131 4.512487
2 5.312730 4.065275 4.290000 5.282050 8.420880 4.887247 4.208798 5.097351
3 4.618823 4.886204 5.541206 6.129387 6.516689 6.141189 4.687675 4.781287
4 4.424384 4.795294 4.942370 6.508199 5.228837 2.976860 3.792249 4.344598
5 4.022017 5.215821 6.401940 6.784248 4.402175 6.516412 6.696392 4.684964
6 6.669169 4.234785 5.104439 4.630779 6.619447 4.847408 4.098612 5.297746
      V129     V130     V131     V132     V133     V134     V135     V136
1 5.045645 5.206508 4.228816 4.907293 5.258209 4.935308 4.051225 5.782975
2 5.148691 5.825192 3.267361 4.927677 4.960907 3.588340 5.472098 4.808207
3 3.682798 6.888697 5.381562 4.254884 5.287771 5.501009 5.311556 4.275023
4 4.487685 6.051952 4.814930 4.388216 5.079611 4.520684 5.600147 5.566095
5 5.379018 4.979327 4.131550 6.628577 5.321466 5.332824 5.100258 6.357093
6 3.784997 5.305350 7.530523 5.350628 4.712961 5.200575 5.512191 5.299063
      V137     V138     V139     V140     V141     V142     V143     V144
1 4.701849 5.610987 5.773380 6.720639 4.128870 3.612225 4.806581 7.412460
2 6.489296 6.339733 3.909340 4.896907 5.136859 7.324349 3.712625 6.670309
3 5.715601 5.766192 6.282850 3.243623 5.550133 2.712845 6.034370 5.169247
4 5.705741 4.401447 5.281007 4.446311 4.769076 4.707313 5.402644 3.405910
5 4.047475 6.231947 3.928555 4.965154 4.046271 4.083697 5.256179 5.304632
6 4.493477 5.888687 5.829484 6.638821 5.781970 4.530036 4.749808 4.408159
      V145     V146     V147     V148     V149     V150
1 4.943786 6.669690 4.866534 5.095807 5.539344 4.469927
2 6.434635 4.671978 3.729377 4.945724 4.993280 5.663794
3 5.744492 5.806716 6.405228 4.608565 4.713444 4.966509
4 5.758867 5.058666 5.059374 6.490075 5.734340 4.727975
5 4.083437 6.526214 5.972734 5.525372 5.591436 4.119335
6 5.309619 5.528707 5.202900 6.278569 3.592465 5.412743

Original code for centering columns of a dataframe:

library(profvis)
profvis({
  # Store in another variable for this run
  data1 <- data
  
  # Get column means
  means <- apply(data1[, names(data1) != "id"], 2, mean)
  
  # Subtract mean from each column
  for (i in seq_along(means)) {
    data1[, names(data1) != "id"][, i] <-
      data1[, names(data1) != "id"][, i] - means[i]
  }
})

Profile apply vs colMeans vs lapply vs vapply:

profvis({
  data1 <- data
  # Four different ways of getting column means
  means <- apply(data1[, names(data1) != "id"], 2, mean)
  means <- colMeans(data1[, names(data1) != "id"])
  means <- lapply(data1[, names(data1) != "id"], mean)
  means <- vapply(data1[, names(data1) != "id"], mean, numeric(1))
})

We decide to use vapply:

profvis({
  data1 <- data
  means <- vapply(data1[, names(data1) != "id"], mean, numeric(1))

  for (i in seq_along(means)) {
    data1[, names(data1) != "id"][, i] <- data1[, names(data1) != "id"][, i] - means[i]
  }
})

Calculate mean and center in one pass:

profvis({
 data1 <- data
 
 # Given a column, normalize values and return them
 col_norm <- function(col) {
   return(col - mean(col))
 }
 
 # Apply the normalizer function over all columns except id
 data1[, names(data1) != "id"] <-
   lapply(data1[, names(data1) != "id"], col_norm)
})

5.2 Example: profiling memory

Original code for cumulative sums:

profvis({
  data <- data.frame(value = runif(1e5))

  data$sum[1] <- data$value[1]
  for (i in seq(2, nrow(data))) {
    data$sum[i] <- data$sum[i-1] + data$value[i]
  }
})

Write a function to avoid expensive indexing by $:

profvis({
  csum <- function(x) {
    if (length(x) < 2) return(x)

    sum <- x[1]
    for (i in seq(2, length(x))) {
      sum[i] <- sum[i-1] + x[i]
    }
    return(sum)
  }
  data$sum <- csum(data$value)
})

Pre-allocate vector:

profvis({
  csum2 <- function(x) {
    if (length(x) < 2) return(x)

    sum <- numeric(length(x))  # Preallocate
    sum[1] <- x[1]
    for (i in seq(2, length(x))) {
      sum[i] <- sum[i-1] + x[i]
    }
    sum
  }
  data$sum <- csum2(data$value)
})
Error in parse_rprof_lines(lines, expr_source): No parsing data available. Maybe your function was too fast?

5.3 Advice

Modularize big projects into small functions. Profile functions as early and as frequently as possible.

6 Debugging

Learning sources: