20  Writing Efficient Code

When we talk about efficient code, we are talking about computational efficiency or writing code that is fast to execute. For this topic, we need to peek under the hood to consider how R is implementing our code in order to understand what can slow down our code. We use the microbenchmark package (Mersmann 2023) to time our code execution. For demonstration, we use a subset of the pain dataset from the HDSinRdata package.

library(microbenchmark)
library(HDSinRdata)
library(tidyverse)

data(pain)
pain <- pain %>% 
  select(-c(BMI, GH_MENTAL_SCORE, GH_PHYSICAL_SCORE, 
                  PROMIS_PAIN_BEHAVIOR)) %>%
  na.omit()

The function Sys.time() returns the current system time. This allows us to find the net time required to execute a chunk of code. For example, the following code finds how long it takes to loop through the data to find the number of people who had an improvement in pain.

start_time <- Sys.time()
num <- 0
for (i in 1:nrow(pain)){
  if (pain$PAIN_INTENSITY_AVERAGE[i] <= 
      pain$PAIN_INTENSITY_AVERAGE.FOLLOW_UP[i]){
    num <- num + 1
  }
}
end_time <- Sys.time()
end_time - start_time
#> Time difference of 0.0197 secs

Using Sys.time() is a simple approach to time our code. The function microbenchmark() from the microbenchmark package allows us to replicate an expression multiple times to get an average execution time. We can pass multiple expressions to this function and specify a unit of measurement and number of times to replicate each expression.

microbenchmark(first expression, second expression, units ="ms", 
  times = 10)

For example, we time the same expression as previously using this package. This time we get some summary statistics in milliseconds. We used the default 100 times to replicate this expression.

microbenchmark( pain_inc = {
  num <- 0
  for (i in 1:nrow(pain)){
    if (pain$PAIN_INTENSITY_AVERAGE[i] <= 
        pain$PAIN_INTENSITY_AVERAGE.FOLLOW_UP[i]){
      num <- num + 1
    }
  }
}, unit = "ms")
#> Warning in microbenchmark(pain_inc = {: less accurate nanosecond times
#> to avoid potential integer overflows
#> Unit: milliseconds
#>      expr  min   lq mean median   uq  max neval
#>  pain_inc 15.9 16.3 17.1   16.7 17.9 20.4   100

20.1 Use Fast and Vectorized Functions

R is known as a slower programming language. In part, R sacrifices computation time to make it more welcoming to new programmers. However, not all of R is actually written in R. Many functions that are in base R are actually written in C or Fortran. These functions are significantly faster than if we wrote them ourselves. When possible, we should use base R functions rather than writing our own. Let’s take a look at the difference in time when we write our own summation function compared to using the sum() function.

my_sum <- function(x){
  out <- 0
  for (i in 1:length(x)){
    out <- out + x[i]
  }
  return(out)
}

x <- 1:100000
microbenchmark(sum_function = my_sum(x),
               builtin_sum = sum(x),
               unit = "ms")
#> Unit: milliseconds
#>          expr      min       lq     mean   median       uq     max
#>  sum_function 2.223758 2.233168 2.283640 2.248358 2.290691 3.61124
#>   builtin_sum 0.000041 0.000041 0.000284 0.000082 0.000144 0.00328
#>  neval
#>    100
#>    100

This also simplifies our code to a single line. Looking back at our first example, we can also simplify our code. We did not need to loop through our data. Instead, we should use a vectorized approach that checks for pain improvement across all observations and then sum up the TRUE/FALSE values.

microbenchmark({
  vector_pain = sum(pain$PAIN_INTENSITY_AVERAGE <= 
                      pain$PAIN_INTENSITY_AVERAGE.FOLLOW_UP)
}, unit = "ms")
#> Unit: milliseconds
#>                                                                                             expr
#>  {     vector_pain = sum(pain$PAIN_INTENSITY_AVERAGE <= pain$PAIN_INTENSITY_AVERAGE.FOLLOW_UP) }
#>     min     lq   mean median     uq    max neval
#>  0.0144 0.0155 0.0156 0.0156 0.0158 0.0214   100

This is much faster. Loops are notoriously slow in R so you often hear the advice to avoid loops. Using apply functions like apply() and sapply() are loop-hiding functions. Using these functions instead of an explicit loop doesn’t improve the efficiency of our code except that by using these functions we often simplify our code as a byproduct of rewriting our code.

The true functions that can improve efficiency over loops are vectorized functions, which can be evaluated on a vector of values. Vectorizing our code means that we are thinking about a vector approach rather than computing something for each element of the vector and looping through these values. A vectorized function returns output that is the same dimensions as the input and operates on these elements in an efficient manner.

The following code chunk shows a simple example of comparing taking the square root of each individual element or calling the sqrt() function, which is a vectorized function. Vectorized functions still have to operate on each element, but that loop is often written in C or Fortran rather than R making it significantly faster so this is a special case where we can utilize the speed of some functions.

microbenchmark(loop_sqrt = {
  for (i in 1:length(x)){
    x[i] <- sqrt(x[i])
  }
  }, sqrt = sqrt(x), unit = "ms")
#> Unit: milliseconds
#>       expr   min    lq  mean median    uq   max neval
#>  loop_sqrt 4.421 4.463 4.617  4.507 4.647 7.218   100
#>       sqrt 0.101 0.141 0.158  0.151 0.169 0.232   100

20.1.1 Practice Question

First, read the documentation of the function tapply(). This is another function in the apply function library that was not covered in Chapter 17. Rewrite the following code without using a loop using the tapply() function. Would you expect this approach to be faster? Why or why not?

mean_pain <- c()
pat_races <- unique(pain$PAT_RACE)

for (r in pat_races){
  mean_pain[r] <- mean(pain$PAIN_INTENSITY_AVERAGE[pain$PAT_RACE == r])
}

20.2 Avoid Copies and Duplicates

Another aspect of our programs that can slow down our operations is any time we need to create a large object. Look at the following code. First, we create a matrix m. We then create a matrix n that is equal to m. Last, we update n by taking the logarithm of all elements plus one, differentiating it from m. R creates copies upon modification. That means, that when we initialize n we have not actually created a second matrix in memory. Instead, we have two names for the same matrix. On the third line, we want to update n so we need to actually create a second matrix that is different from m.

m <- matrix(rpois(100000, 6), ncol=1000)
n <- m
n <- log(n + 1)

This is different from the subsequent code which updates m itself. In this case, R can modify the matrix in place by going through each element and updating its value rather than creating a new matrix. As the size of our data grows, creating copies can be expensive. Imagine, m being a large genetic data set with RNA sequencing data. First, this object may take up a lot of memory so creating a copy may mean we could run out of memory. Second, copying over all this information is expensive.

m <- log(m + 1)

Let’s consider another case. In the following code, we have two functions that both find the proportion of people within 1, 2, and \(>2\) standard deviations from the mean for one of the PROMIS instrument variables. When we input pain as an argument to the first function, we do not create a copy of it since we haven’t modified the data frame. However, once we create a new column, this means that we have to copy the full data frame. The second function instead takes in a single column, requiring us to copy only this information. The difference in execution time shows an edge to the second method, but the difference is small. This indicates that actually computing this new column and finding the proportions takes more time than the duplication.

code_promis1 <- function(df){
  # create new column with categories
  df$PAIN_PHYSICAL_FUNCTION_CUT <- case_when(
    abs(df$PROMIS_PHYSICAL_FUNCTION-50) <= 10 ~ "<= 1SD",
    abs(df$PROMIS_PHYSICAL_FUNCTION-50) <= 20 ~ "<= 2 SD",
    TRUE ~ "> 2SD")
  
  # get proportions
  res <- prop.table(table(df$PAIN_PHYSICAL_FUNCTION_CUT))
  return(res)
}

code_promis2 <- function(v){
  # create new column with categories
  v <- case_when(
    abs(v-50) <= 10 ~ "<= 1SD",
    abs(v-50) <= 20 ~ "<= 2 SD",
    TRUE ~ "> 2SD")
  
  # get proportions
  res <- prop.table(table(v))
  return(res)
}

microbenchmark(code_promis1(pain), 
               code_promis2(pain$PROMIS_PHYSICAL_FUNCTION), 
               unit = "ms")
#> Unit: milliseconds
#>                                         expr   min    lq  mean median
#>                           code_promis1(pain) 0.645 0.696 0.847  0.730
#>  code_promis2(pain$PROMIS_PHYSICAL_FUNCTION) 0.622 0.674 0.770  0.703
#>     uq  max neval
#>  0.807 3.54   100
#>  0.741 3.76   100

Another time when we create copies of objects is when we modify their size. Functions like cbind(), rbind(), and c() create a new object that needs to copy over information to create one vector, matrix, or data frame. If we know the size of the final vector, matrix, or data frame, we can pre-allocate that space and fill in values. This means that the computer won’t have to repeatedly find more space. For example, take a look at the two ways to simulate a random walk in one dimension in the following code chunk. In the first method, the length of the vector v changes on each iteration of the loop whereas in the second v always has length n.

rw1 <- function(n){
  v <- c(0)
  for (i in 2:n){
    v[i] <- v[i-1] + rbinom(1, 1, 0.5)
  }
  return(v)
}

rw2 <- function(n){
  v <- rep(0, n)
  for (i in 2:n){
    v[i] <- v[i-1] + rbinom(1, 1, 0.5)
  }
}

microbenchmark(rw1(10000), rw2(10000), unit="ms")
#> Unit: milliseconds
#>        expr  min   lq mean median   uq   max neval
#>  rw1(10000) 5.16 5.29 5.69   5.37 5.53 11.20   100
#>  rw2(10000) 4.29 4.36 4.58   4.40 4.56  6.57   100

This also works for data frames or matrices. In the following code chunk, we generate a random matrix in three ways. The first creates the matrix with a single line, the second initializes an empty matrix and then fills in each row, and the last dynamically updates the size of the matrix on each iteration.

random_mat1 <- function(n){
  m <- matrix(sample(1:3, n^2, replace = TRUE), nrow = n)
  return(m)
}

random_mat2 <- function(n){
  m <- matrix(nrow = n, ncol = n)
  for (i in 1:n){
    m[i,] <- sample(1:3, n, replace = TRUE)
  }
}

random_mat3 <- function(n){
  m <- NULL
  for (i in 1:n){
    m <- rbind(m, sample(1:3, n, replace = TRUE))
  }
  return(m)
}

microbenchmark(random_mat1(100),
               random_mat2(100),
               random_mat3(100),
               unit = "ms")
#> Unit: milliseconds
#>              expr   min    lq  mean median    uq  max neval
#>  random_mat1(100) 0.233 0.243 0.260  0.247 0.255 1.32   100
#>  random_mat2(100) 0.496 0.512 0.578  0.520 0.534 4.16   100
#>  random_mat3(100) 1.076 1.156 1.351  1.188 1.230 5.15   100

This demonstrates that if you need to update the values of a vector, matrix, or data frame, try to do as much reassignment at once. For example, changing the a whole column at a time is better than changing the individual values. This avoids additional copies.

20.2.1 Practice Question

The following code fits a linear model for each racial group and records the coefficient. Rewrite this code so that we pre-allocate the results vector and use the microbenchmark to compare the efficiency between the two approaches.

coefs <- c()
pat_races <- unique(pain$PAT_RACE)
for (r in pat_races){
  df <- pain[pain$PAT_RACE == r, ]
  if (nrow(df) > 3){
    new_coef <- lm(PROMIS_DEPRESSION ~ PROMIS_ANXIETY,
                 data = df)$coefficients[2]
    coefs <- c(coefs, new_coef)
  }
}

20.3 Parallel Programming

Another approach to make our code more efficient is using parallel processing. When we run loops in R, only one iteration is run at a time. For example, the following code runs a random walk 100 times in serial. Parallel processing allows us to execute multiple calls to this function at the same time. This is done by running these processes on separate cores, or processors, on your machine. For example, if we had six cores available we would be able to run 1/6 of the replications on each processor and reduce our overall computation time. The parallel package (R Core Team 2024) contains functions to implement parallel processing on different operating systems. Unfortunately, this functionality is often not supported within RStudio and is not covered in this book. We recommend using the mclapply() function from the parallel package to implement parallel processing using forking. This does not work on Windows but is much simpler to implement. For parallel processing on Windows, we recommend looking into the socket approach using the parLapply() function in the parallel package.

replicate(100, rw1(1000))

20.4 Exercises

  1. The following code chunk includes four attempts to create a new column LOWER_BACK to the pain data. Note that the second and third attempt are vectorized whereas the first and fourth are not. Time each approach and order them from fastest to slowest.

    # Attempt 1: loop
    pain$LOWER_BACK <- vector(mode="logical", length=nrow(pain))
    for (i in 1:nrow(pain)) { # for every row
      if ((pain$X218[i] == 1) | (pain$X219[i] == 1)) { 
        pain$LOWER_BACK[i] <- TRUE 
      } else {
        pain$LOWER_BACK[i] <- FALSE
      }
    }
    
    # Attempt 2: logic
    pain$LOWER_BACK <- ((pain$X218[i] == 1) | 
                          (pain$X219[i] == 1))
    
    # Attempt 3: which
    pain$LOWER_BACK <- FALSE
    true_ind <- which((pain$X218[i] == 1) | (pain$X219[i] == 1))
    pain$LOWER_BACK[true_ind] <- TRUE
    
    # Attempt 4: apply
    back_pain <- function(x){
      if((x['X218'] == 1) | (x['X219'] == 1)){
        return(TRUE)
        } 
      return(FALSE)
    }
    pain$BACK_PAIN <- apply(pain, 1, back_pain)
  2. Examine the following code and determine what is being computed. Then, rewrite the code to make it more efficient. Use the microbenchmark package to compare the execution time, and explain why your approach is more efficient.

    n <- 100000
    x1 <- rnorm(n, 10, 1)
    x2 <- rbinom(n, 1, 0.2)
    y <- numeric(0)
    
    for (i in 1:n){
      if (x2[i] == 1){
        y[i] <- rnorm(1,2 *x1[i], 0.7)
      } else{
        y[i] <- rnorm(1, 1+3*x1[i], 0.2)
      }
    }
    df <- data.frame(x1 = x1, x2 = x2, y = y)
  3. Suppose we want to find the five most frequently reported pain regions by racial group. Code your solution (a) using at least one loop and (b) pivoting the data on the body region columns and then using dplyr functions to summarize. Compare the efficiency of both approaches.