Session informnation for reproducibility:

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_4.1.2  magrittr_2.0.2  fastmap_1.1.0   cli_3.2.0       tools_4.1.2     htmltools_0.5.2 rstudioapi_0.13
##  [8] yaml_2.2.2      jquerylib_0.1.4 stringi_1.7.6   rmarkdown_2.11  knitr_1.37      stringr_1.4.0   xfun_0.29      
## [15] digest_0.6.29   rlang_1.0.1     evaluate_0.14

Advanced R

To gain a deep understanding of how R works, the book Advanced R by Hadley Wickham is a must read. Read now to save numerous hours you might waste in future.

We cover select topics on coding style, benchmarking, profiling, debugging, parallel computing, byte code compiling, Rcpp, and package development.

Style

Benchmark

Sources:

In order to identify performance issue, we need to measure runtime accurately.

system.time

set.seed(280)
x <- runif(1e6)

system.time({sqrt(x)})
##    user  system elapsed 
##   0.008   0.002   0.011
system.time({x ^ 0.5})
##    user  system elapsed 
##   0.026   0.000   0.025
system.time({exp(log(x) / 2)})
##    user  system elapsed 
##   0.018   0.000   0.018

Check ?proc.time for the explanation:

The ‘user time’ is the CPU time charged for the execution of user instructions of the calling process. The ‘system time’ is the CPU time charged for execution by the system on behalf of the calling process.

From William Dunlap:

"User CPU time" gives the CPU time spent by the current process (i.e., the current R session) and "system CPU time" gives the CPU time spent by the kernel (the operating system) on behalf of the current process. The operating system is used for things like opening files, doing input or output, starting other processes, and looking at the system clock: operations that involve resources that many processes must share. Different operating systems will have different things done by the operating system.

start <- Sys.time()
y <- exp(log(x) / 2)
sum_y <- sum(y)
end <- Sys.time()

end - start
## Time difference of 0.02336001 secs

microbenchmark

library("microbenchmark")
library("ggplot2")

mbm <- microbenchmark(
  sqrt(x),
  x ^ 0.5,
  exp(log(x) / 2),
  times = 100
)
mbm
## Unit: milliseconds
##           expr       min        lq      mean    median        uq       max neval
##        sqrt(x)  2.257792  2.379045  3.371148  2.476715  2.829994  7.379589   100
##          x^0.5 21.359337 21.686701 23.376252 22.256155 25.097638 31.661316   100
##  exp(log(x)/2) 14.973111 15.312506 16.495652 15.530736 17.835068 21.333190   100

Results from microbenchmark can be nicely plotted in base R or ggplot2.

boxplot(mbm)

autoplot(mbm)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

bench

The bench package is another tool for microbenchmarking. The output is a tibble:

library(bench)

(lb <- bench::mark(
  sqrt(x),
  x ^ 0.5,
  exp(log(x) / 2)
))
## # A tibble: 3 × 6
##   expression         min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 sqrt(x)         2.22ms   2.42ms     389.     7.63MB     98.2
## 2 x^0.5           21.9ms  22.87ms      43.1    7.63MB     10.2
## 3 exp(log(x)/2)  15.45ms  15.97ms      62.6    7.63MB     15.6

To visualize the result

plot(lb)
## Loading required namespace: tidyr

It is colored according to GC (garbage collection) levels.

Profiling

Premature optimization is the root of all evil (or at least most of it) in programming.
-Don Knuth

Sources:

Example: profiling time

First generate test data:

times <- 4e5
cols <- 150
data <- as.data.frame(x = matrix(rnorm(times * cols, mean = 5), ncol = cols))
data <- cbind(id = paste0("g", seq_len(times)), data)
head(data)
##   id       V1       V2       V3       V4       V5       V6       V7       V8       V9      V10      V11      V12
## 1 g1 5.969774 3.386956 4.358790 6.237727 5.771275 5.174312 5.037446 4.430052 3.945362 6.991074 6.534173 6.317850
## 2 g2 4.921927 5.633818 5.694171 4.865483 4.153967 5.061355 4.371293 5.463183 5.313754 5.346877 5.352962 4.297411
## 3 g3 5.800428 5.549560 3.976693 4.171676 4.666550 7.123001 5.098244 5.293764 4.865347 4.281242 4.907693 4.644739
## 4 g4 7.898909 4.615669 4.377937 4.701138 5.101872 5.295811 4.312929 4.085740 3.162529 3.530056 3.844779 5.075276
## 5 g5 6.452353 5.880938 5.926650 4.594801 4.482622 5.954435 4.586131 3.444331 5.776090 3.128314 5.455446 5.906633
## 6 g6 4.469903 6.975048 4.217334 6.039186 3.682259 4.181690 5.154283 5.046682 6.352984 6.289363 5.674194 3.793744
##        V13      V14      V15      V16      V17      V18      V19      V20      V21      V22      V23      V24      V25
## 1 5.068947 5.795827 5.498428 4.983893 4.021139 5.065696 6.310364 5.490599 5.055548 4.238561 5.797920 4.650697 6.232738
## 2 3.143680 5.640137 5.002181 2.896327 4.800560 4.762682 3.607615 4.842496 6.059028 6.391931 7.143750 4.092829 5.430874
## 3 4.526487 5.236702 2.669363 4.749898 5.418325 4.843147 4.703208 5.158282 4.913269 5.997505 6.404710 5.901887 5.027204
## 4 4.674053 4.944213 5.056897 6.646479 4.596020 7.193654 5.363815 5.828282 4.825783 4.223200 4.941217 5.421718 6.068461
## 5 6.148552 4.399970 4.660836 3.235290 5.168906 3.363993 3.267161 6.299007 3.990615 4.938966 6.021251 4.856988 5.456461
## 6 3.325485 5.174891 4.695926 4.282914 3.766488 5.115333 3.881455 5.711343 5.260267 6.811890 6.109790 5.321400 6.915472
##        V26      V27      V28      V29      V30      V31      V32      V33      V34      V35      V36      V37      V38
## 1 5.156774 3.934398 3.779090 4.736236 3.174204 6.293351 5.757692 4.955279 6.137107 3.573192 4.722608 6.093101 5.095926
## 2 4.537045 6.011686 4.729784 4.512812 5.574587 5.457530 4.388429 4.955159 4.581849 5.322105 5.565669 5.911121 5.071331
## 3 4.896893 4.960644 4.795403 4.603998 4.743484 4.563350 3.839632 4.702261 5.063408 5.118905 2.957961 4.564707 2.932893
## 4 4.881969 5.554265 4.336586 6.011450 5.118163 4.219831 5.042094 4.108503 3.184510 4.287867 5.577840 5.290784 6.163332
## 5 6.307981 4.052145 5.367455 5.202699 4.090536 4.852753 3.555741 6.238919 4.414152 4.087591 4.255676 4.117047 5.126822
## 6 5.672714 3.100835 7.315498 5.420842 5.900181 4.055118 4.149567 5.278965 4.367095 2.899642 4.237844 6.699517 5.530437
##        V39      V40      V41      V42      V43      V44      V45      V46      V47      V48      V49      V50      V51
## 1 4.211594 5.593647 4.753439 5.169350 4.883482 3.657573 5.669892 4.241260 4.851962 5.456667 3.641526 5.823936 6.202179
## 2 3.647852 6.422367 6.187101 4.716270 5.597624 4.413653 4.230248 4.871157 5.293433 6.435590 5.709755 5.460538 4.667820
## 3 4.673558 4.913145 5.438515 4.850280 4.626902 4.459188 5.827907 5.663024 3.457956 5.721593 4.349273 5.278583 3.925600
## 4 5.269505 3.863132 5.892379 5.762037 5.179667 3.942267 5.778172 5.932535 5.524665 4.771691 6.172233 6.314465 6.106169
## 5 5.187829 5.426037 4.901255 3.280638 5.652811 4.206320 5.858897 6.658522 5.633509 4.963152 4.258391 6.554396 5.381009
## 6 5.831208 3.738705 3.961880 4.871338 5.252205 4.892354 5.132825 5.681368 7.427848 3.656597 5.196717 5.721946 6.507628
##        V52      V53      V54      V55      V56      V57      V58      V59      V60      V61      V62      V63      V64
## 1 4.682491 6.020682 2.888328 4.884376 5.080382 5.731241 3.716672 3.728804 4.225844 5.893676 5.236128 3.998496 5.052559
## 2 4.270379 4.517615 5.204095 4.635607 5.231288 3.800630 5.785667 3.602729 5.613489 4.841687 4.655119 5.548548 3.668189
## 3 5.579634 4.434822 5.277278 5.749743 5.783613 4.409653 4.344321 4.891546 4.541498 5.192576 5.162769 5.449993 4.251714
## 4 5.754318 5.706827 6.369271 5.991514 3.357682 5.511138 4.634418 4.294338 5.769028 5.738561 4.282375 5.683305 3.788938
## 5 4.826608 5.104414 5.925495 6.495626 3.528083 5.250808 6.254993 7.384261 5.711009 4.899243 4.683286 5.427239 4.776874
## 6 4.720609 4.800197 4.981737 4.804950 4.029324 4.003427 4.317158 5.344575 7.102476 3.520562 3.071539 5.494856 6.203147
##        V65      V66      V67      V68      V69      V70      V71      V72      V73      V74      V75      V76      V77
## 1 2.297626 5.044528 4.422164 4.317439 5.039205 5.306640 4.291307 5.752981 4.452937 4.556165 4.878821 5.494998 4.525805
## 2 5.042682 5.209917 6.233088 3.673400 5.550484 4.341351 5.621920 5.107727 5.582111 5.070413 3.553271 3.302500 4.883199
## 3 4.954122 3.964389 3.632149 5.902254 5.180800 5.905438 6.001834 4.165093 7.069034 5.077702 4.823577 3.950463 2.968174
## 4 4.699785 5.317041 4.387484 2.815468 5.945561 4.348289 4.949502 3.373685 3.325205 5.391997 5.576931 4.869578 5.218397
## 5 5.688282 4.922704 5.282254 5.433178 7.002321 5.236936 6.255325 3.910518 4.194588 5.936640 5.947325 6.494725 6.002182
## 6 6.325418 4.847684 7.013564 5.339205 3.909243 5.400927 5.143125 3.502081 3.909446 4.450924 4.379779 4.845526 5.499161
##        V78      V79      V80      V81      V82      V83      V84      V85      V86      V87      V88      V89      V90
## 1 5.598838 5.510052 4.335159 3.841143 4.854982 5.862762 6.055590 5.570398 5.605576 5.207664 4.721803 4.638047 4.648935
## 2 5.669165 6.934187 5.756974 6.290887 3.717678 5.352763 4.009332 4.243106 6.890147 6.300223 6.049974 5.293047 4.015187
## 3 2.981413 5.337635 5.745397 4.865970 3.963592 4.502974 4.874401 4.185517 4.027433 4.543304 4.727077 4.302911 4.715384
## 4 3.661650 4.783694 7.297424 5.910201 6.051057 3.789773 4.745911 4.428981 3.917806 4.718456 4.085395 6.146197 5.816736
## 5 4.566777 3.771062 5.382190 4.958616 2.225116 5.862221 4.738276 5.119915 3.244853 5.478629 5.415818 7.350988 5.664298
## 6 2.473238 4.043085 3.965896 3.419392 5.939352 4.620255 3.502446 7.679182 7.079275 4.400633 5.172185 5.316849 5.509925
##        V91      V92      V93      V94      V95      V96      V97      V98      V99     V100     V101     V102     V103
## 1 5.864618 4.781209 4.581879 4.869569 4.365018 3.933875 4.926531 5.451521 5.017248 5.636656 4.776153 6.359240 5.665790
## 2 6.509665 4.338551 4.430226 4.618964 4.688742 5.389455 4.288900 5.201093 5.817945 5.068234 5.083228 6.793019 5.144909
## 3 7.099666 4.696381 4.224483 3.484758 4.979978 5.162419 4.980340 3.123095 4.738060 4.542765 4.638650 5.814557 7.978221
## 4 2.364329 5.978111 3.916872 5.119176 6.135774 6.309547 4.931209 4.736936 3.554819 3.171603 4.795017 5.165080 4.827155
## 5 5.541353 6.000726 5.227135 5.361953 3.361382 4.935839 5.301755 4.509777 5.518396 4.667021 5.465201 5.009373 5.611512
## 6 4.510909 3.316033 5.467789 5.780469 4.307894 5.820533 3.036929 5.952751 3.954832 5.889513 5.194754 4.391003 5.226084
##       V104     V105     V106     V107     V108     V109     V110     V111     V112     V113     V114     V115     V116
## 1 6.934336 4.916882 3.682590 3.553191 6.178398 6.588264 3.722993 4.291036 6.209619 3.578261 6.869507 2.238355 4.383842
## 2 3.751889 5.029695 4.816148 3.061924 4.118335 5.784075 5.031129 4.164354 6.112392 5.941705 4.701242 3.920104 7.204458
## 3 5.774194 4.389153 4.837424 4.265490 5.130126 5.268602 5.803634 5.390342 6.347929 7.672582 3.462561 5.254552 4.463835
## 4 4.330540 4.128308 5.855140 4.941786 4.834643 4.263134 3.589032 5.405492 4.955846 4.916321 4.609474 5.251307 4.959063
## 5 5.683406 6.508588 4.252722 4.999025 4.962796 5.694972 4.879281 4.167013 4.673221 4.497916 6.566364 3.930197 4.736816
## 6 4.291935 5.436822 4.295753 3.682477 4.373195 5.968806 4.794256 4.186473 4.809896 4.757196 5.410635 5.014765 5.003775
##       V117     V118     V119     V120     V121     V122     V123     V124     V125     V126     V127     V128     V129
## 1 4.417523 6.048996 4.675138 5.842443 5.167029 5.750582 4.907758 5.865223 5.261798 4.956968 5.643643 5.812596 5.078079
## 2 4.485306 7.334596 5.865684 4.203705 3.606904 5.627711 5.587762 5.654150 5.269067 4.922916 4.823450 5.121979 5.395293
## 3 4.016373 6.311201 4.517438 4.700522 4.729896 4.529697 5.358662 5.112413 4.884113 6.012762 6.285833 4.477839 3.045543
## 4 3.014013 6.210320 4.482665 5.731177 3.996006 6.642559 4.243635 6.820607 6.025009 4.956143 5.423347 5.900742 6.472146
## 5 5.845934 6.098909 4.571074 6.489460 4.710165 5.496734 4.419287 4.525932 4.282508 2.885029 5.337152 4.868927 5.371391
## 6 4.754222 4.567107 3.474825 6.064451 5.758978 3.747149 4.523980 5.090233 4.284195 5.217521 5.638891 3.792649 7.184338
##       V130     V131     V132     V133     V134     V135     V136     V137     V138     V139     V140     V141     V142
## 1 6.353728 4.885178 5.647268 5.396727 4.887594 3.792632 5.802780 3.985640 5.488433 5.278389 6.098580 4.490176 4.915071
## 2 3.354223 5.184184 3.709929 4.930617 4.181895 5.545752 6.747582 5.242106 6.356912 4.750170 5.434490 5.296510 5.959623
## 3 4.332591 3.238270 5.711071 5.764195 4.717790 4.795234 5.768002 5.869586 5.333952 4.018724 7.537137 2.985009 6.095209
## 4 4.903896 3.245853 5.172849 5.101201 4.803509 6.542996 5.211635 6.129498 4.581313 5.013140 4.977602 5.929854 5.276177
## 5 4.888848 4.789275 4.892961 7.136214 6.075162 5.259250 4.575743 4.704665 4.872247 5.141363 6.538962 3.871012 5.762745
## 6 5.835156 4.187135 5.513385 2.575642 5.078832 4.306050 5.178530 4.919545 6.425867 3.329013 4.154823 3.671133 6.976441
##       V143     V144     V145     V146     V147     V148     V149     V150
## 1 4.637860 6.020174 3.860709 4.360477 4.848465 5.005363 6.678806 3.734484
## 2 5.673556 2.154634 6.513392 3.879315 4.890934 5.383120 5.062598 6.181172
## 3 4.210550 4.409794 5.687756 7.131847 6.448625 3.967368 5.851488 5.865080
## 4 4.564731 4.240755 3.590409 5.770194 5.654670 6.992913 6.333515 5.449932
## 5 5.553001 5.418704 4.207660 5.257442 4.447403 3.551037 4.092570 5.093482
## 6 5.135457 4.695289 5.825259 4.921798 3.497042 3.861398 4.491844 5.183752

Original code for centering columns of a dataframe:

library(profvis)
profvis({
  # Store in another variable for this run
  data1 <- data
  
  # Get column means
  means <- apply(data1[, names(data1) != "id"], 2, mean)
  
  # Subtract mean from each column
  for (i in seq_along(means)) {
    data1[, names(data1) != "id"][, i] <-
      data1[, names(data1) != "id"][, i] - means[i]
  }
})

Profile apply vs colMeans vs lapply vs vapply:

profvis({
  data1 <- data
  # Four different ways of getting column means
  means <- apply(data1[, names(data1) != "id"], 2, mean)
  means <- colMeans(data1[, names(data1) != "id"])
  means <- lapply(data1[, names(data1) != "id"], mean)
  means <- vapply(data1[, names(data1) != "id"], mean, numeric(1))
})

We decide to use vapply:

profvis({
  data1 <- data
  means <- vapply(data1[, names(data1) != "id"], mean, numeric(1))

  for (i in seq_along(means)) {
    data1[, names(data1) != "id"][, i] <- data1[, names(data1) != "id"][, i] - means[i]
  }
})

Calculate mean and center in one pass:

profvis({
 data1 <- data
 
 # Given a column, normalize values and return them
 col_norm <- function(col) {
   return(col - mean(col))
 }
 
 # Apply the normalizer function over all columns except id
 data1[, names(data1) != "id"] <-
   lapply(data1[, names(data1) != "id"], col_norm)
})

Example: profiling memory

Original code for cumulative sums:

profvis({
  data <- data.frame(value = runif(1e5))

  data$sum[1] <- data$value[1]
  for (i in seq(2, nrow(data))) {
    data$sum[i] <- data$sum[i-1] + data$value[i]
  }
})

Write a function to avoid expensive indexing by $:

profvis({
  csum <- function(x) {
    if (length(x) < 2) return(x)

    sum <- x[1]
    for (i in seq(2, length(x))) {
      sum[i] <- sum[i-1] + x[i]
    }
    return(sum)
  }
  data$sum <- csum(data$value)
})

Pre-allocate vector:

profvis({
  csum2 <- function(x) {
    if (length(x) < 2) return(x)

    sum <- numeric(length(x))  # Preallocate
    sum[1] <- x[1]
    for (i in seq(2, length(x))) {
      sum[i] <- sum[i-1] + x[i]
    }
    sum
  }
  data$sum <- csum2(data$value)
})

Advice

Modularize big projects into small functions. Profile functions as early and as frequently as possible.

Debugging

Learning sources: