To loop or not to loop…?

In R you have multiple options when repeating calculations: vectorized operations, for loops, and apply functions.

This lesson is an extension of Analyzing Multiple Data Sets. In that lesson, we introduced how to run a custom function, analyze, over multiple data files:

analyze <- function(filename) {
  # Plots the average, min, and max inflammation over time.
  # Input is character string of a csv file.
  dat <- read.csv(file = filename, header = FALSE)
  avg_day_inflammation <- apply(dat, 2, mean)
  plot(avg_day_inflammation)
  max_day_inflammation <- apply(dat, 2, max)
  plot(max_day_inflammation)
  min_day_inflammation <- apply(dat, 2, min)
  plot(min_day_inflammation)
}
filenames <- list.files(pattern = "csv")

Vectorized operations

A key difference between R and many other languages is a topic known as vectorization. When you wrote the total function, we mentioned that R already has sum to do this; sum is much faster than the interpreted for loop because sum is coded in C to work with a vector of numbers. Many of R's functions work this way; the loop is hidden from you in C. Learning to use vectorized operations is a key skill in R.

For example, to add pairs of numbers contained in two vectors

a <- 1:10
b <- 1:10

you could loop over the pairs adding each in turn, but that would be very inefficient in R.

res <- numeric(length = length(a))
for (i in seq_along(a)) {
  res[i] <- a[i] + b[i]
}
res
 [1]  2  4  6  8 10 12 14 16 18 20

Instead, + is a vectorized function which can operate on entire vectors at once

res2 <- a + b
all.equal(res, res2)
[1] TRUE

for or apply?

A for loop is used to apply the same function calls to a collection of objects. R has a family of functions, the apply family, which can be used in much the same way. You've already used one of the family, apply in the first lesson. The apply family members include

  • apply - apply over the margins of an array (e.g. the rows or columns of a matrix)
  • lapply - apply over an object and return list
  • sapply - apply over an object and return a simplified object (an array) if possible
  • vapply - similar to sapply but you specify the type of object returned by the iterations

Each of these has an argument FUN which takes a function to apply to each element of the object. Instead of looping over filenames and calling analyze, as you did earlier, you could sapply over filenames with FUN = analyze:

sapply(filenames, FUN = analyze)

Deciding whether to use for or one of the apply family is really personal preference. Using an apply family function forces to you encapsulate your operations as a function rather than separate calls with for. for loops are often more natural in some circumstances; for several related operations, a for loop will avoid you having to pass in a lot of extra arguments to your function.

Loops in R are slow

No, they are not! If you follow some golden rules:

  1. Don't use a loop when a vectorised alternative exists
  2. Don't grow objects (via c, cbind, etc) during the loop - R has to create a new object and copy across the information just to add a new element or row/column
  3. Allocate an object to hold the results and fill it in during the loop

As an example, we'll create a new version of analyze that will return the mean inflammation per day (column) of each file.

analyze2 <- function(filenames) {
  for (f in seq_along(filenames)) {
    fdata <- read.csv(filenames[f], header = FALSE)
    res <- apply(fdata, 2, mean)
    if (f == 1) {
      out <- res
    } else {
      # The loop is slowed by this call to cbind that grows the object
      out <- cbind(out, res)
    }
  }
  return(out)
}

system.time(avg2 <- analyze2(filenames))
   user  system elapsed 
  0.044   0.000   0.045 

Note how we add a new column to out at each iteration? This is a cardinal sin of writing a for loop in R.

Instead, we can create an empty matrix with the right dimensions (rows/columns) to hold the results. Then we loop over the files but this time we fill in the fth column of our results matrix out. This time there is no copying/growing for R to deal with.

analyze3 <- function(filenames) {
  out <- matrix(ncol = length(filenames), nrow = 40) ## assuming 40 here from files 
  for (f in seq_along(filenames)) {
    fdata <- read.csv(filenames[f], header = FALSE)
    out[, f] <- apply(fdata, 2, mean)
  }
  return(out)
}

system.time(avg3 <- analyze3(filenames))
   user  system elapsed 
  0.056   0.004   0.057 

In this simple example there is little difference in the compute time of analyze2 and analyze3. This is because we are only iterating over 12 files and hence we only incur 12 copy/grow operations. If we were doing this over more files or the data objects we were growing were larger, the penalty for copying/growing would be much larger.

Note that apply handles these memory allocation issues for you, but then you have to write the loop part as a function to pass to apply. At its heart, apply is just a for loop with extra convenience.