sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_4.1.2 magrittr_2.0.2 fastmap_1.1.0 cli_3.2.0 tools_4.1.2 htmltools_0.5.2 rstudioapi_0.13
## [8] yaml_2.2.2 jquerylib_0.1.4 stringi_1.7.6 rmarkdown_2.11 knitr_1.37 stringr_1.4.0 xfun_0.29
## [15] digest_0.6.29 rlang_1.0.1 evaluate_0.14
Scientific planning: What experiments would verify/invalidate our hypotheses? What parameter settings should we consider?
Code planning: What does the code need to do? How will the code fit together? What functions will be used? What are their inputs/outputs etc.
Implementation:
Prototype functions, classes, etc., partial documentation
Write unit tests
Implement code, run unit tests, debug
Broader testing, more debugging
Profile code, identify bottlenecks
Optimize code
Conduct experiments.
Full documentation.
After profiling, what to do to improve performance?
Ask are there obvious speedups? Are things being computed unnecessarily? Are you using a data.frame
where you should be using a matrix etc.
Look up your problem on internet (e.g., search for "lapply slow" or "speeding up lapply
" etc.)
Try the just-in-time (JIT) compiler.
Consider re-writing some or all of the code in a compiled language (e.g., C/C++).
Try parallelization.
R typical execution:
compiler
package by Luke Tierney is distributed with base R. compiler
package compiles an R function into bytecode.Brute-force for
loop for summing a vector:
sum_r <- function(x) {
sumx <- 0.0
for (i in 1:length(x)) {
sumx <- sumx + x[i]
}
return(sumx)
}
sum_r
## function(x) {
## sumx <- 0.0
## for (i in 1:length(x)) {
## sumx <- sumx + x[i]
## }
## return(sumx)
## }
Run the code on 1e6 elements:
library(microbenchmark)
library(ggplot2)
x = seq(from = 0, to = 100, by = 0.0001)
microbenchmark(sum_r(x))
## Unit: milliseconds
## expr min lq mean median uq max neval
## sum_r(x) 23.87565 24.40799 25.03142 24.802 25.34322 29.4934 100
Let's compile the function into bytecode sum_rc
and benchmark again:
library(compiler)
sum_rc <- cmpfun(sum_r)
sum_rc
## function(x) {
## sumx <- 0.0
## for (i in 1:length(x)) {
## sumx <- sumx + x[i]
## }
## return(sumx)
## }
## <bytecode: 0x7fa06f563d38>
Benchmark again:
microbenchmark(sum_r(x), sum_rc(x))
## Unit: milliseconds
## expr min lq mean median uq max neval
## sum_r(x) 23.94092 24.38157 24.90104 24.73841 25.09523 30.86835 100
## sum_rc(x) 23.78484 24.38262 25.01077 24.77257 25.24357 39.25395 100
Surprise! Surprise! Compiling into bytecode does not help at all. Following code shows that the function sum_r
is already compiled into bytecode before execution.
sum_r
## function(x) {
## sumx <- 0.0
## for (i in 1:length(x)) {
## sumx <- sumx + x[i]
## }
## return(sumx)
## }
## <bytecode: 0x7fa06facc470>
Let's turn off JIT (just-in-time compilation), re-define the (same) sum_r
function, and benchmark again:
enableJIT(0) # set JIT level to 0
## [1] 3
sum_r <- function(x) {
sumx <- 0.0
for (i in 1:length(x)) {
sumx <- sumx + x[i]
}
return(sumx)
}
microbenchmark(sum_r(x))
## Unit: milliseconds
## expr min lq mean median uq max neval
## sum_r(x) 302.8338 309.2754 314.8936 312.2404 314.3593 367.2415 100
Now we witness the slowness of the un-compiled sum_r
.
Documentation of enableJIT
:
enableJIT enables or disables just-in-time (JIT) compilation. JIT is disabled if the argument is 0. If level is 1 then larger closures are compiled before their first use. If level is 2, then some small closures are also compiled before their second use. If level is 3 then in addition all top level loops are compiled before they are executed. JIT level 3 requires the compiler option optimize to be 2 or 3. The JIT level can also be selected by starting R with the environment variable R_ENABLE_JIT set to one of these values. Calling enableJIT with a negative argument returns the current JIT level. The default JIT level is 3.
Since R 3.4.0 (Apr 2017), the JIT (‘Just In Time’) bytecode compiler is enabled by default at its level 3.
If you create a package, then you automatically compile the package on installation by adding
ByteCompile: true
to the DESCRIPTION
file.
Note Matlab has employed JIT technology since 2002 and Julia is designed totally based on JIT. R finally is on the same boat.
Learning sources:
compiler
package compiles R code into bytecode, which is translated to machine code by interpreter during execution. A low-level language such as C, C++, and Fortran is compiled into machine code directly, yielding the maximum efficiency.
cppFunction
Rcpp
package provides a convenient way to embed C++ code in R code.
library(Rcpp)
cppFunction('double sum_c(NumericVector x) {
int n = x.size();
double total = 0;
for(int i = 0; i < n; ++i) {
total += x[i];
}
return total;
}')
sum_c
## function (x)
## .Call(<pointer: 0x10c9df3f0>, x)
Benchmark (1) compiled C++ function sum_c
together with (2) R function sum_r
, (3) compiled R function sum_rc
, and (4) the sum
function in base R:
mbm <- microbenchmark(sum_r(x), sum_rc(x), sum_c(x), sum(x))
mbm
## Unit: milliseconds
## expr min lq mean median uq max neval
## sum_r(x) 298.534493 306.931543 313.406882 309.593094 312.703165 357.442944 100
## sum_rc(x) 23.651712 24.246180 24.674176 24.632272 25.026094 26.431014 100
## sum_c(x) 1.207248 1.374153 1.433109 1.430508 1.465812 1.833153 100
## sum(x) 2.366705 2.469039 2.558540 2.540915 2.612300 3.057396 100
autoplot(mbm)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
Remember we turned off JIT by enableGIT(0)
earlier.
sourceCpp
In realistic projects, we write standalone C++ files and then source them into R using sourceCpp()
. For example
cat sum.cpp
## #include <Rcpp.h>
## using namespace Rcpp;
##
## // This is a simple example of exporting a C++ function to R. You can
## // source this function into an R session using the Rcpp::sourceCpp
## // function (or via the Source button on the editor toolbar). Learn
## // more about Rcpp at:
## //
## // http://www.rcpp.org/
## // http://adv-r.had.co.nz/Rcpp.html
## // http://gallery.rcpp.org/
## //
##
## // [[Rcpp::export]]
## double sum_c(NumericVector x) {
## int n = x.size();
## double total = 0;
## for(int i = 0; i < n; ++i) {
## total += x[i];
## }
## return total;
## }
##
## // You can include R code blocks in C++ files processed with sourceCpp
## // (useful for testing and development). The R code will be automatically
## // run after the compilation.
## //
##
## /*** R
## sum_c(as.double(1:10))
## */
sourceCpp("sum.cpp")
##
## > sum_c(as.double(1:10))
## [1] 55
sum_c
## function (x)
## .Call(<pointer: 0x10ff863f0>, x)
Fact: base R is single-threaded. Even you request a fancy instance with 96 vCPUs, running R code is just using 1/96th of its power.
Embarrassing Parallelism: seperate a problem into a small number of large parallel pieces, where different pieces of the problem never really have to communicate with each other.
To perform multi-core computation in R using parallel
package.
Authors: Brian Ripley, Luke Tieney, Simon Urbanek.
Included in base R since 2.14.0 (2011).
Based on the snow
(Luke Tierney) and multicore
(Simon Urbanek) packages.
To find the number of cores:
library(parallel)
detectCores()
## [1] 8
We have a "new" method that estimates the population mean by averaging the observations indexed by prime numbers.
## check if a given integer is prime
isPrime = function(n) {
if (n <= 3) {
return (TRUE)
}
if (any((n %% 2:floor(sqrt(n))) == 0)) {
return (FALSE)
}
return (TRUE)
}
## estimate mean only using observation with prime indices
estMeanPrimes = function(x) {
n <- length(x)
ind <- sapply(1:n, isPrime)
return (mean(x[ind]))
}
We want to compare our method to the traditional sample average estimator by simulation studies.
## compare methods: sample avg and prime-indexed avg
compare_methods <- function(dist = "gaussian", n = 100, reps = 100, seed = 123) {
# set seed according to command argument `seed`
set.seed(seed)
# preallocate space to store estimators
msePrimeAvg <- 0.0
mseSamplAvg <- 0.0
# loop over simulation replicates
for (r in 1:reps) {
# simulate data according to command arguments `n` and `distr`
if (dist == "gaussian") {
x = rnorm(n)
} else if (dist == "t1") {
x = rcauchy(n)
} else if (dist == "t5") {
x = rt(n, 5)
} else {
stop(paste("unrecognized dist: ", dist))
}
# prime indexed mean estimator and classical sample average estimator
msePrimeAvg <- msePrimeAvg + estMeanPrimes(x)^2
mseSamplAvg <- mseSamplAvg + mean(x)^2
}
mseSamplAvg <- mseSamplAvg / reps
msePrimeAvg <- msePrimeAvg / reps
return(c(mseSamplAvg, msePrimeAvg))
}
We need to loop over 3 generative models (distTypes
) and 20 samples sizes (nVals
). That are 60 embarrassingly parallel tasks.
seed = 280
reps = 500
nVals = seq(100, 1000, by = 50)
distTypes = c("gaussian", "t5", "t1")
This is the serial code that double-loop over combinations of distTypes
and nVals
:
## simulation study with combination of generative model `dist` and
## sample size `n` (serial code)
simres1 = matrix(0.0, nrow = 2 * length(nVals), ncol = length(distTypes))
i = 1 # entry index
system.time(
for (dist in distTypes) {
for (n in nVals) {
#print(paste("n=", n, " dist=", dist, " seed=", seed, " reps=", reps, sep=""))
simres1[i:(i + 1)] = compare_methods(dist, n, reps, seed)
i <- i + 2
}
}
)
## user system elapsed
## 32.177 0.082 32.296
simres1
## [,1] [,2] [,3]
## [1,] 0.0103989436 0.017070603 312.4001
## [2,] 0.0410217819 0.066503177 200.3237
## [3,] 0.0065484669 0.011260420 173.9631
## [4,] 0.0297390639 0.047465397 199.5330
## [5,] 0.0056445593 0.007754855 68026.7023
## [6,] 0.0215380206 0.040039145 1230343.1341
## [7,] 0.0040803523 0.006871930 43609.7815
## [8,] 0.0165144049 0.032353077 931755.3182
## [9,] 0.0032566766 0.005417194 30283.1286
## [10,] 0.0161191554 0.026330133 684726.8788
## [11,] 0.0027565672 0.004444172 22306.1369
## [12,] 0.0145039253 0.022820075 539105.3392
## [13,] 0.0024915830 0.003798500 17119.0807
## [14,] 0.0122801335 0.022788299 435528.4778
## [15,] 0.0023706676 0.003360507 13531.4104
## [16,] 0.0112703627 0.016674640 111.4673
## [17,] 0.0020190283 0.003147367 10973.3489
## [18,] 0.0106157492 0.016027485 278.9663
## [19,] 0.0017567901 0.002863640 9069.6647
## [20,] 0.0096185720 0.016444671 261373.7646
## [21,] 0.0016441481 0.002637964 7629.9867
## [22,] 0.0081784426 0.013710539 296.6235
## [23,] 0.0015075246 0.002498450 6498.2362
## [24,] 0.0088018140 0.012909942 191986.6388
## [25,] 0.0014372130 0.002308089 5603.9395
## [26,] 0.0077292632 0.012789483 171280.5448
## [27,] 0.0012924543 0.002216936 4889.8739
## [28,] 0.0069012154 0.011052562 170.3975
## [29,] 0.0011994654 0.001987311 4299.6838
## [30,] 0.0067559611 0.011291788 178.7454
## [31,] 0.0011642413 0.001888637 3806.8472
## [32,] 0.0070131993 0.010048327 140.2389
## [33,] 0.0011566121 0.001873365 3401.9766
## [34,] 0.0065066558 0.009020973 34.1644
## [35,] 0.0010506067 0.001595430 3049.0432
## [36,] 0.0060026682 0.010338424 103578.0598
## [37,] 0.0009770234 0.001618095 2768.4517
## [38,] 0.0054705674 0.009229294 143.5544
mcmapply
Run the same task using mcmapply
function (parallel analog of mapply
) in the parallel
package:
## simulation study with combination of generative model `dist` and
## sample size `n` (parallel code using mcmapply)
library(parallel)
system.time({
simres2 <- mcmapply(compare_methods,
rep(distTypes, each = length(nVals), times = 1),
rep(nVals, each = 1, times = length(distTypes)),
reps,
seed,
mc.cores = 4)
})
## user system elapsed
## 26.294 0.493 9.599
simres2 <- matrix(unlist(simres2), ncol = length(distTypes))
simres2
## [,1] [,2] [,3]
## [1,] 0.0103989436 0.017070603 312.4001
## [2,] 0.0410217819 0.066503177 200.3237
## [3,] 0.0065484669 0.011260420 173.9631
## [4,] 0.0297390639 0.047465397 199.5330
## [5,] 0.0056445593 0.007754855 68026.7023
## [6,] 0.0215380206 0.040039145 1230343.1341
## [7,] 0.0040803523 0.006871930 43609.7815
## [8,] 0.0165144049 0.032353077 931755.3182
## [9,] 0.0032566766 0.005417194 30283.1286
## [10,] 0.0161191554 0.026330133 684726.8788
## [11,] 0.0027565672 0.004444172 22306.1369
## [12,] 0.0145039253 0.022820075 539105.3392
## [13,] 0.0024915830 0.003798500 17119.0807
## [14,] 0.0122801335 0.022788299 435528.4778
## [15,] 0.0023706676 0.003360507 13531.4104
## [16,] 0.0112703627 0.016674640 111.4673
## [17,] 0.0020190283 0.003147367 10973.3489
## [18,] 0.0106157492 0.016027485 278.9663
## [19,] 0.0017567901 0.002863640 9069.6647
## [20,] 0.0096185720 0.016444671 261373.7646
## [21,] 0.0016441481 0.002637964 7629.9867
## [22,] 0.0081784426 0.013710539 296.6235
## [23,] 0.0015075246 0.002498450 6498.2362
## [24,] 0.0088018140 0.012909942 191986.6388
## [25,] 0.0014372130 0.002308089 5603.9395
## [26,] 0.0077292632 0.012789483 171280.5448
## [27,] 0.0012924543 0.002216936 4889.8739
## [28,] 0.0069012154 0.011052562 170.3975
## [29,] 0.0011994654 0.001987311 4299.6838
## [30,] 0.0067559611 0.011291788 178.7454
## [31,] 0.0011642413 0.001888637 3806.8472
## [32,] 0.0070131993 0.010048327 140.2389
## [33,] 0.0011566121 0.001873365 3401.9766
## [34,] 0.0065066558 0.009020973 34.1644
## [35,] 0.0010506067 0.001595430 3049.0432
## [36,] 0.0060026682 0.010338424 103578.0598
## [37,] 0.0009770234 0.001618095 2768.4517
## [38,] 0.0054705674 0.009229294 143.5544
We see roughly 3x-4x speedup with mc.cores=4
.
mcmapply
, mclapply
and related functions rely on the forking capability of POSIX operating systems (e.g. Linux, MacOS) and is not available in Windows.
parLapply
, parApply
, parCapply
, parRapply
, clusterApply
, clusterMap
, and related functions create a cluster of workers based on either socket (default) or forking. Socket is available on all platforms: Linux, MacOS, and Windows.
clusterMap
The same simulation example using clusterMap
function:
cl <- makeCluster(getOption("cl.cores", 4))
clusterExport(cl, c("isPrime", "estMeanPrimes", "compare_methods"))
system.time({
simres3 <- clusterMap(cl, compare_methods,
rep(distTypes, each = length(nVals), times = 1),
rep(nVals, each = 1, times = length(distTypes)),
reps,
seed,
.scheduling = "dynamic")
})
## user system elapsed
## 0.021 0.006 7.068
simres3 <- matrix(unlist(simres3), ncol = length(distTypes))
stopCluster(cl)
simres3
## [,1] [,2] [,3]
## [1,] 0.0103989436 0.017070603 312.4001
## [2,] 0.0410217819 0.066503177 200.3237
## [3,] 0.0065484669 0.011260420 173.9631
## [4,] 0.0297390639 0.047465397 199.5330
## [5,] 0.0056445593 0.007754855 68026.7023
## [6,] 0.0215380206 0.040039145 1230343.1341
## [7,] 0.0040803523 0.006871930 43609.7815
## [8,] 0.0165144049 0.032353077 931755.3182
## [9,] 0.0032566766 0.005417194 30283.1286
## [10,] 0.0161191554 0.026330133 684726.8788
## [11,] 0.0027565672 0.004444172 22306.1369
## [12,] 0.0145039253 0.022820075 539105.3392
## [13,] 0.0024915830 0.003798500 17119.0807
## [14,] 0.0122801335 0.022788299 435528.4778
## [15,] 0.0023706676 0.003360507 13531.4104
## [16,] 0.0112703627 0.016674640 111.4673
## [17,] 0.0020190283 0.003147367 10973.3489
## [18,] 0.0106157492 0.016027485 278.9663
## [19,] 0.0017567901 0.002863640 9069.6647
## [20,] 0.0096185720 0.016444671 261373.7646
## [21,] 0.0016441481 0.002637964 7629.9867
## [22,] 0.0081784426 0.013710539 296.6235
## [23,] 0.0015075246 0.002498450 6498.2362
## [24,] 0.0088018140 0.012909942 191986.6388
## [25,] 0.0014372130 0.002308089 5603.9395
## [26,] 0.0077292632 0.012789483 171280.5448
## [27,] 0.0012924543 0.002216936 4889.8739
## [28,] 0.0069012154 0.011052562 170.3975
## [29,] 0.0011994654 0.001987311 4299.6838
## [30,] 0.0067559611 0.011291788 178.7454
## [31,] 0.0011642413 0.001888637 3806.8472
## [32,] 0.0070131993 0.010048327 140.2389
## [33,] 0.0011566121 0.001873365 3401.9766
## [34,] 0.0065066558 0.009020973 34.1644
## [35,] 0.0010506067 0.001595430 3049.0432
## [36,] 0.0060026682 0.010338424 103578.0598
## [37,] 0.0009770234 0.001618095 2768.4517
## [38,] 0.0054705674 0.009229294 143.5544
Again we see roughly 3x-4x speedup by using 4 cores.
clusterExport
export the data and other information that the child process need from the parent process to the child prrocess.
An R package is the basic unit of reusable code.
Main components of an R package:
A DESCRIPTION
file
An R
directory with the code
A man
directory with documentation (create this automatically by devtools::document()
and roxygen2
)
A NAMESPACE
file listing user-level functions in the package (create this automatically by roxygen2
)
Optional but highly recommended: a tests
directory (create by usethis::use_testthat()
and test by devtools::test()
)
Optional: a src
directory with compiled code (C++)
Learning resources:
Book _R Packages_ by Hadley Wickham and Jennt Bryan
RStudio tutorial: https://support.rstudio.com/hc/en-us/articles/200486488-Developing-Packages-with-RStudio
Writing a package that uses Rcpp: https://cran.r-project.org/web/packages/Rcpp/vignettes/Rcpp-package.pdf