Select Topics in R Programming

Typical development cycle for computational statistics
Bytecode compilation
- Example: summing a vector
Rcpp
- Use cppFunction
- Use sourceCpp
Parallel computing
Package development

sessionInfo()

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_4.1.2  magrittr_2.0.2  fastmap_1.1.0   cli_3.2.0       tools_4.1.2     htmltools_0.5.2 rstudioapi_0.13
##  [8] yaml_2.2.2      jquerylib_0.1.4 stringi_1.7.6   rmarkdown_2.11  knitr_1.37      stringr_1.4.0   xfun_0.29      
## [15] digest_0.6.29   rlang_1.0.1     evaluate_0.14

Typical development cycle for computational statistics

Scientific planning: What experiments would verify/invalidate our hypotheses? What parameter settings should we consider?
Code planning: What does the code need to do? How will the code fit together? What functions will be used? What are their inputs/outputs etc.
Implementation:
1. Prototype functions, classes, etc., partial documentation
2. Write unit tests
3. Implement code, run unit tests, debug
4. Broader testing, more debugging
5. Profile code, identify bottlenecks
6. Optimize code
Conduct experiments.
Full documentation.

Bytecode compilation

After profiling, what to do to improve performance?
1. Ask are there obvious speedups? Are things being computed unnecessarily? Are you using a data.frame where you should be using a matrix etc.
2. Look up your problem on internet (e.g., search for "lapply slow" or "speeding up lapply" etc.)
3. Try the just-in-time (JIT) compiler.
4. Consider re-writing some or all of the code in a compiled language (e.g., C/C++).
5. Try parallelization.
R typical execution:

Since R 2.1.4, the compiler package by Luke Tierney is distributed with base R. compiler package compiles an R function into bytecode.

Example: summing a vector

Brute-force for loop for summing a vector:

sum_r <- function(x) {
  sumx <- 0.0
  for (i in 1:length(x)) {
    sumx <- sumx + x[i]
  }
  return(sumx)
}
sum_r

## function(x) {
##   sumx <- 0.0
##   for (i in 1:length(x)) {
##     sumx <- sumx + x[i]
##   }
##   return(sumx)
## }

Run the code on 1e6 elements:

library(microbenchmark)
library(ggplot2)

x = seq(from = 0, to = 100, by = 0.0001)
microbenchmark(sum_r(x))

## Unit: milliseconds
##      expr      min       lq     mean median       uq     max neval
##  sum_r(x) 23.87565 24.40799 25.03142 24.802 25.34322 29.4934   100

Let's compile the function into bytecode sum_rc and benchmark again:

library(compiler)
sum_rc <- cmpfun(sum_r)
sum_rc

## function(x) {
##   sumx <- 0.0
##   for (i in 1:length(x)) {
##     sumx <- sumx + x[i]
##   }
##   return(sumx)
## }
## <bytecode: 0x7fa06f563d38>

Benchmark again:

microbenchmark(sum_r(x), sum_rc(x))

## Unit: milliseconds
##       expr      min       lq     mean   median       uq      max neval
##   sum_r(x) 23.94092 24.38157 24.90104 24.73841 25.09523 30.86835   100
##  sum_rc(x) 23.78484 24.38262 25.01077 24.77257 25.24357 39.25395   100

Surprise! Surprise! Compiling into bytecode does not help at all. Following code shows that the function sum_r is already compiled into bytecode before execution.

sum_r

## function(x) {
##   sumx <- 0.0
##   for (i in 1:length(x)) {
##     sumx <- sumx + x[i]
##   }
##   return(sumx)
## }
## <bytecode: 0x7fa06facc470>

Let's turn off JIT (just-in-time compilation), re-define the (same) sum_r function, and benchmark again:

enableJIT(0) # set JIT level to 0

## [1] 3

sum_r <- function(x) {
  sumx <- 0.0
  for (i in 1:length(x)) {
    sumx <- sumx + x[i]
  }
  return(sumx)
}
microbenchmark(sum_r(x))

## Unit: milliseconds
##      expr      min       lq     mean   median       uq      max neval
##  sum_r(x) 302.8338 309.2754 314.8936 312.2404 314.3593 367.2415   100

Now we witness the slowness of the un-compiled sum_r.

Documentation of enableJIT:

enableJIT enables or disables just-in-time (JIT) compilation. JIT is disabled if the argument is 0. If level is 1 then larger closures are compiled before their first use. If level is 2, then some small closures are also compiled before their second use. If level is 3 then in addition all top level loops are compiled before they are executed. JIT level 3 requires the compiler option optimize to be 2 or 3. The JIT level can also be selected by starting R with the environment variable R_ENABLE_JIT set to one of these values. Calling enableJIT with a negative argument returns the current JIT level. The default JIT level is 3.

Since R 3.4.0 (Apr 2017), the JIT (‘Just In Time’) bytecode compiler is enabled by default at its level 3.

If you create a package, then you automatically compile the package on installation by adding

ByteCompile: true

to the DESCRIPTION file.

Note Matlab has employed JIT technology since 2002 and Julia is designed totally based on JIT. R finally is on the same boat.

Rcpp

Learning sources:

Advanced R: https://adv-r.hadley.nz/rcpp.html

compiler package compiles R code into bytecode, which is translated to machine code by interpreter during execution. A low-level language such as C, C++, and Fortran is compiled into machine code directly, yielding the maximum efficiency.

Use `cppFunction`

Rcpp package provides a convenient way to embed C++ code in R code.

library(Rcpp)

cppFunction('double sum_c(NumericVector x) {
  int n = x.size();
  double total = 0;
  for(int i = 0; i < n; ++i) {
    total += x[i];
  }
  return total;
}')
sum_c

## function (x) 
## .Call(<pointer: 0x10c9df3f0>, x)

Benchmark (1) compiled C++ function sum_c together with (2) R function sum_r, (3) compiled R function sum_rc, and (4) the sum function in base R:

mbm <- microbenchmark(sum_r(x), sum_rc(x), sum_c(x), sum(x))
mbm

## Unit: milliseconds
##       expr        min         lq       mean     median         uq        max neval
##   sum_r(x) 298.534493 306.931543 313.406882 309.593094 312.703165 357.442944   100
##  sum_rc(x)  23.651712  24.246180  24.674176  24.632272  25.026094  26.431014   100
##   sum_c(x)   1.207248   1.374153   1.433109   1.430508   1.465812   1.833153   100
##     sum(x)   2.366705   2.469039   2.558540   2.540915   2.612300   3.057396   100

autoplot(mbm)

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

Remember we turned off JIT by enableGIT(0) earlier.

Use `sourceCpp`

In realistic projects, we write standalone C++ files and then source them into R using sourceCpp(). For example

cat sum.cpp

## #include <Rcpp.h>
## using namespace Rcpp;
## 
## // This is a simple example of exporting a C++ function to R. You can
## // source this function into an R session using the Rcpp::sourceCpp 
## // function (or via the Source button on the editor toolbar). Learn
## // more about Rcpp at:
## //
## //   http://www.rcpp.org/
## //   http://adv-r.had.co.nz/Rcpp.html
## //   http://gallery.rcpp.org/
## //
## 
## // [[Rcpp::export]]
## double sum_c(NumericVector x) {
##   int n = x.size();
##   double total = 0;
##   for(int i = 0; i < n; ++i) {
##     total += x[i];
##   }
##   return total;
## }
## 
## // You can include R code blocks in C++ files processed with sourceCpp
## // (useful for testing and development). The R code will be automatically 
## // run after the compilation.
## //
## 
## /*** R
## sum_c(as.double(1:10))
## */

sourceCpp("sum.cpp")

## 
## > sum_c(as.double(1:10))
## [1] 55

sum_c

## function (x) 
## .Call(<pointer: 0x10ff863f0>, x)

Parallel computing

Fact: base R is single-threaded. Even you request a fancy instance with 96 vCPUs, running R code is just using 1/96th of its power.
Embarrassing Parallelism: seperate a problem into a small number of large parallel pieces, where different pieces of the problem never really have to communicate with each other.
To perform multi-core computation in R using parallel package.
- Authors: Brian Ripley, Luke Tieney, Simon Urbanek.
- Included in base R since 2.14.0 (2011).
- Based on the snow (Luke Tierney) and multicore (Simon Urbanek) packages.
- To find the number of cores:

library(parallel)
detectCores()

## [1] 8

Simulation example

We have a "new" method that estimates the population mean by averaging the observations indexed by prime numbers.

## check if a given integer is prime
isPrime = function(n) {
  if (n <= 3) {
    return (TRUE)
  }
  if (any((n %% 2:floor(sqrt(n))) == 0)) {
    return (FALSE)
  }
  return (TRUE)
}

## estimate mean only using observation with prime indices
estMeanPrimes = function(x) {
  n <- length(x)
  ind <- sapply(1:n, isPrime)
  return (mean(x[ind]))
}

We want to compare our method to the traditional sample average estimator by simulation studies.

## compare methods: sample avg and prime-indexed avg
compare_methods <- function(dist = "gaussian", n = 100, reps = 100, seed = 123) {
  # set seed according to command argument `seed`
  set.seed(seed)
  
  # preallocate space to store estimators
  msePrimeAvg <- 0.0
  mseSamplAvg <- 0.0
  # loop over simulation replicates
  for (r in 1:reps) {
    # simulate data according to command arguments `n` and `distr`
    if (dist == "gaussian") {
      x = rnorm(n)
    } else if (dist == "t1") {
      x = rcauchy(n)
    } else if (dist == "t5") {
      x = rt(n, 5)
    } else {
      stop(paste("unrecognized dist: ", dist))
    }
    # prime indexed mean estimator and classical sample average estimator
    msePrimeAvg <- msePrimeAvg + estMeanPrimes(x)^2 
    mseSamplAvg <- mseSamplAvg + mean(x)^2
  }
  mseSamplAvg <- mseSamplAvg / reps
  msePrimeAvg <- msePrimeAvg / reps
  return(c(mseSamplAvg, msePrimeAvg))
}

Serial code

We need to loop over 3 generative models (distTypes) and 20 samples sizes (nVals). That are 60 embarrassingly parallel tasks.

seed = 280
reps = 500
nVals = seq(100, 1000, by = 50)
distTypes = c("gaussian", "t5", "t1")

This is the serial code that double-loop over combinations of distTypes and nVals:

## simulation study with combination of generative model `dist` and
## sample size `n` (serial code)
simres1 = matrix(0.0, nrow = 2 * length(nVals), ncol = length(distTypes))
i = 1 # entry index
system.time(
  for (dist in distTypes) {
    for (n in nVals) {
      #print(paste("n=", n, " dist=", dist, " seed=", seed, " reps=", reps, sep=""))
      simres1[i:(i + 1)] = compare_methods(dist, n, reps, seed)
      i <- i + 2
    }
  }
)

##    user  system elapsed 
##  32.177   0.082  32.296

simres1

##               [,1]        [,2]         [,3]
##  [1,] 0.0103989436 0.017070603     312.4001
##  [2,] 0.0410217819 0.066503177     200.3237
##  [3,] 0.0065484669 0.011260420     173.9631
##  [4,] 0.0297390639 0.047465397     199.5330
##  [5,] 0.0056445593 0.007754855   68026.7023
##  [6,] 0.0215380206 0.040039145 1230343.1341
##  [7,] 0.0040803523 0.006871930   43609.7815
##  [8,] 0.0165144049 0.032353077  931755.3182
##  [9,] 0.0032566766 0.005417194   30283.1286
## [10,] 0.0161191554 0.026330133  684726.8788
## [11,] 0.0027565672 0.004444172   22306.1369
## [12,] 0.0145039253 0.022820075  539105.3392
## [13,] 0.0024915830 0.003798500   17119.0807
## [14,] 0.0122801335 0.022788299  435528.4778
## [15,] 0.0023706676 0.003360507   13531.4104
## [16,] 0.0112703627 0.016674640     111.4673
## [17,] 0.0020190283 0.003147367   10973.3489
## [18,] 0.0106157492 0.016027485     278.9663
## [19,] 0.0017567901 0.002863640    9069.6647
## [20,] 0.0096185720 0.016444671  261373.7646
## [21,] 0.0016441481 0.002637964    7629.9867
## [22,] 0.0081784426 0.013710539     296.6235
## [23,] 0.0015075246 0.002498450    6498.2362
## [24,] 0.0088018140 0.012909942  191986.6388
## [25,] 0.0014372130 0.002308089    5603.9395
## [26,] 0.0077292632 0.012789483  171280.5448
## [27,] 0.0012924543 0.002216936    4889.8739
## [28,] 0.0069012154 0.011052562     170.3975
## [29,] 0.0011994654 0.001987311    4299.6838
## [30,] 0.0067559611 0.011291788     178.7454
## [31,] 0.0011642413 0.001888637    3806.8472
## [32,] 0.0070131993 0.010048327     140.2389
## [33,] 0.0011566121 0.001873365    3401.9766
## [34,] 0.0065066558 0.009020973      34.1644
## [35,] 0.0010506067 0.001595430    3049.0432
## [36,] 0.0060026682 0.010338424  103578.0598
## [37,] 0.0009770234 0.001618095    2768.4517
## [38,] 0.0054705674 0.009229294     143.5544

Using `mcmapply`

Run the same task using mcmapply function (parallel analog of mapply) in the parallel package:

## simulation study with combination of generative model `dist` and
## sample size `n` (parallel code using mcmapply)
library(parallel)
system.time({
  simres2 <- mcmapply(compare_methods, 
                      rep(distTypes, each = length(nVals), times = 1),
                      rep(nVals, each = 1, times = length(distTypes)),
                      reps, 
                      seed,
                      mc.cores = 4)
})

##    user  system elapsed 
##  26.294   0.493   9.599

simres2 <- matrix(unlist(simres2), ncol = length(distTypes))
simres2

##               [,1]        [,2]         [,3]
##  [1,] 0.0103989436 0.017070603     312.4001
##  [2,] 0.0410217819 0.066503177     200.3237
##  [3,] 0.0065484669 0.011260420     173.9631
##  [4,] 0.0297390639 0.047465397     199.5330
##  [5,] 0.0056445593 0.007754855   68026.7023
##  [6,] 0.0215380206 0.040039145 1230343.1341
##  [7,] 0.0040803523 0.006871930   43609.7815
##  [8,] 0.0165144049 0.032353077  931755.3182
##  [9,] 0.0032566766 0.005417194   30283.1286
## [10,] 0.0161191554 0.026330133  684726.8788
## [11,] 0.0027565672 0.004444172   22306.1369
## [12,] 0.0145039253 0.022820075  539105.3392
## [13,] 0.0024915830 0.003798500   17119.0807
## [14,] 0.0122801335 0.022788299  435528.4778
## [15,] 0.0023706676 0.003360507   13531.4104
## [16,] 0.0112703627 0.016674640     111.4673
## [17,] 0.0020190283 0.003147367   10973.3489
## [18,] 0.0106157492 0.016027485     278.9663
## [19,] 0.0017567901 0.002863640    9069.6647
## [20,] 0.0096185720 0.016444671  261373.7646
## [21,] 0.0016441481 0.002637964    7629.9867
## [22,] 0.0081784426 0.013710539     296.6235
## [23,] 0.0015075246 0.002498450    6498.2362
## [24,] 0.0088018140 0.012909942  191986.6388
## [25,] 0.0014372130 0.002308089    5603.9395
## [26,] 0.0077292632 0.012789483  171280.5448
## [27,] 0.0012924543 0.002216936    4889.8739
## [28,] 0.0069012154 0.011052562     170.3975
## [29,] 0.0011994654 0.001987311    4299.6838
## [30,] 0.0067559611 0.011291788     178.7454
## [31,] 0.0011642413 0.001888637    3806.8472
## [32,] 0.0070131993 0.010048327     140.2389
## [33,] 0.0011566121 0.001873365    3401.9766
## [34,] 0.0065066558 0.009020973      34.1644
## [35,] 0.0010506067 0.001595430    3049.0432
## [36,] 0.0060026682 0.010338424  103578.0598
## [37,] 0.0009770234 0.001618095    2768.4517
## [38,] 0.0054705674 0.009229294     143.5544

We see roughly 3x-4x speedup with mc.cores=4.
mcmapply, mclapply and related functions rely on the forking capability of POSIX operating systems (e.g. Linux, MacOS) and is not available in Windows.
parLapply, parApply, parCapply, parRapply, clusterApply, clusterMap, and related functions create a cluster of workers based on either socket (default) or forking. Socket is available on all platforms: Linux, MacOS, and Windows.

Using `clusterMap`

The same simulation example using clusterMap function:

cl <- makeCluster(getOption("cl.cores", 4))
clusterExport(cl, c("isPrime", "estMeanPrimes", "compare_methods"))
system.time({
  simres3 <- clusterMap(cl, compare_methods,
                        rep(distTypes, each = length(nVals), times = 1),
                        rep(nVals, each = 1, times = length(distTypes)),
                        reps,
                        seed,
                        .scheduling = "dynamic")
})

##    user  system elapsed 
##   0.021   0.006   7.068

simres3 <- matrix(unlist(simres3), ncol = length(distTypes))
stopCluster(cl)
simres3

##               [,1]        [,2]         [,3]
##  [1,] 0.0103989436 0.017070603     312.4001
##  [2,] 0.0410217819 0.066503177     200.3237
##  [3,] 0.0065484669 0.011260420     173.9631
##  [4,] 0.0297390639 0.047465397     199.5330
##  [5,] 0.0056445593 0.007754855   68026.7023
##  [6,] 0.0215380206 0.040039145 1230343.1341
##  [7,] 0.0040803523 0.006871930   43609.7815
##  [8,] 0.0165144049 0.032353077  931755.3182
##  [9,] 0.0032566766 0.005417194   30283.1286
## [10,] 0.0161191554 0.026330133  684726.8788
## [11,] 0.0027565672 0.004444172   22306.1369
## [12,] 0.0145039253 0.022820075  539105.3392
## [13,] 0.0024915830 0.003798500   17119.0807
## [14,] 0.0122801335 0.022788299  435528.4778
## [15,] 0.0023706676 0.003360507   13531.4104
## [16,] 0.0112703627 0.016674640     111.4673
## [17,] 0.0020190283 0.003147367   10973.3489
## [18,] 0.0106157492 0.016027485     278.9663
## [19,] 0.0017567901 0.002863640    9069.6647
## [20,] 0.0096185720 0.016444671  261373.7646
## [21,] 0.0016441481 0.002637964    7629.9867
## [22,] 0.0081784426 0.013710539     296.6235
## [23,] 0.0015075246 0.002498450    6498.2362
## [24,] 0.0088018140 0.012909942  191986.6388
## [25,] 0.0014372130 0.002308089    5603.9395
## [26,] 0.0077292632 0.012789483  171280.5448
## [27,] 0.0012924543 0.002216936    4889.8739
## [28,] 0.0069012154 0.011052562     170.3975
## [29,] 0.0011994654 0.001987311    4299.6838
## [30,] 0.0067559611 0.011291788     178.7454
## [31,] 0.0011642413 0.001888637    3806.8472
## [32,] 0.0070131993 0.010048327     140.2389
## [33,] 0.0011566121 0.001873365    3401.9766
## [34,] 0.0065066558 0.009020973      34.1644
## [35,] 0.0010506067 0.001595430    3049.0432
## [36,] 0.0060026682 0.010338424  103578.0598
## [37,] 0.0009770234 0.001618095    2768.4517
## [38,] 0.0054705674 0.009229294     143.5544

Again we see roughly 3x-4x speedup by using 4 cores.
clusterExport export the data and other information that the child process need from the parent process to the child prrocess.

Package development

An R package is the basic unit of reusable code.

Main components of an R package:

A DESCRIPTION file
An R directory with the code
A man directory with documentation (create this automatically by devtools::document() and roxygen2)
A NAMESPACE file listing user-level functions in the package (create this automatically by roxygen2)
Optional but highly recommended: a tests directory (create by usethis::use_testthat() and test by devtools::test())
Optional: a src directory with compiled code (C++)

Learning resources:

Book _R Packages_ by Hadley Wickham and Jennt Bryan
RStudio tutorial: https://support.rstudio.com/hc/en-us/articles/200486488-Developing-Packages-with-RStudio
Writing a package that uses Rcpp: https://cran.r-project.org/web/packages/Rcpp/vignettes/Rcpp-package.pdf

Select Topics in R Programming - Part II

Biostat 203B

Dr. Hua Zhou @ UCLA

Mar 8, 2022

Typical development cycle for computational statistics

Bytecode compilation

Example: summing a vector

Rcpp

Use `cppFunction`

Use `sourceCpp`

Parallel computing

Simulation example

Serial code

Using `mcmapply`

Using `clusterMap`

Package development

Select Topics in R Programming - Part II

Biostat 203B

Dr. Hua Zhou @ UCLA

Mar 8, 2022

Typical development cycle for computational statistics

Bytecode compilation

Example: summing a vector

Rcpp

Use cppFunction

Use sourceCpp

Parallel computing

Simulation example

Serial code

Using mcmapply

Using clusterMap

Package development

Use `cppFunction`

Use `sourceCpp`

Using `mcmapply`

Using `clusterMap`