9.3 Parallel Processing

Most modern computers have multiple processing or CPU cores, e.g. a 4-core or 8-core processor. This principally means that these computers can execute multiple processes simultaneously thereby dividing the wall clock time required to run the computations on a one core machine by a factor equal to the number of cores the machine has. This can be a very meaningful performance improvement, e.g. an 8 core machine could reduce the running time of a computation from a week to less than a day. Some data centers possess compute nodes with 28 or even 64 cores, meaning a computation that would normally run for two months could complete in a single day if all cores are used! Processes that are running at the same time on different cores are said to be running in parallel, while a computation that is not parallelized runs serially.

9.3.1 Brief Introduction to Parallelization

Not all computations can be parallelized, and some can only be made parallel by using sophisticated mathematical models and algorithms. However, certain classes of problems are pleasingly parallel, meaning their structure makes it easy if not trivial to divide into parts that can be run in parallel. In general, how “parallelizeable” a computation is depends on how the subparts of the computation relate to one another.

Consider the problem of mapping high throughput sequencing reads against a reference genome. If the task is to identify the location(s) in the genome where each read matches, the locations of one read do not influence the locations of another read; they are independent of each other. Since they are independent, in principle they can be done in parallel, meaning that, again in priniple, every one of the millions of reads could be aligned against the genome at the same time. This would mean the wall clock time required to align all the reads in a multi-million read dataset would take as long as it takes to align a single read. This makes short read alignment a pleasingly parallel problem. Unfortunately, in practice there are technical constraints preventing this increase in performance, but most modern alignment programs like bwa and STAR exploit this inherently parallel structure to improve performance by using multiple cores at the same time.

In general, splitting up a computation into pieces that can run in parallel does not always lead to performance improvements. There are several ways computations are constrained based on which aspects of the algorithm or program require the most time to run. Briefly, an algorithm can be:

  • compute-bound - the computations take the largest amount of time
  • memory-bound - the amount of main memory (i.e. RAM) on the machine determines how quickly the computation can complete
  • input/output (IO)-bound - writing to and from hard disk takes the most time
  • network-bound - transfering data over a network, usually between processes running in parallel, takes the most time

These concepts are covered in the field of high-performance computing and are beyond the scope of this class. However, for the purposes of this section, the problems we are concerned with when parallelizing computations are compute-bound computations.

Parallelism can be attained on both the multiprocessor and the distributed computing level. In distributed computing, different subtasks of a parallelizable computation are executed on different physical computers. Each of these subtasks could possibly be further parallelized using multiple cores on each system. In cluster computing or cloud environments, enormous performance improvements can be attained by utilizing both levels of parallelization. For example, if you submit 100 jobs via qsub to a compute cluster and they all run at the same time, and each job uses 16 cores on its local machine, that amounts to a 1,600x speedup. This parallelism would reduce a computation taking over 4 years to run in a single day!

9.3.2 apply and Friends Are Pleasingly Parallel

In R, loops are most efficiently expressed using the apply function and its variants lapply and vapply. Recall that these functions accept a function and a collection of inputs, usually a list. The apply function iterates over the input collection and executes the function on each of them. In general, the inputs are assumed to be independent, and the function does not have any outside dependencies. This is precisely the pattern needed for pleasantly parallel tasks. In general, any iteration using an apply function can be parallelized in R.

9.3.3 The parallel package

R can leverage multiple core architectures to execute processes in parallel with the [parallel package], which is installed as a base R package. The main function in this package that enables parallelism is mclapply(), or multicore apply. mclapply accepts all of the same arguments as lapply and a few additional starting with mc.. The argument to mclapply that enables parallelism is mc.cores, which specifies the number of cores to use when executing tasks in parallel.

mclapply is not available on Windows, only Mac and linux.

Consider the following script mclapply.R which runs a function that prints out its input with the current time and then waits (i.e. sleeps) for three seconds:

library(parallel)

args <- commandArgs(trailingOnly=TRUE)
cores <- as.integer(args[1])

ret <- mclapply(
    1:10,
    function(x) {
         print(paste(Sys.time(),'process',x))
         Sys.sleep(3)
    },
    mc.cores=cores
)

Note the number of cores to be used is passed in as a command line argument. If we run the script with one core, we can see the functions run serially:

$ Rscript mclapply.R 1
[1] "2022-03-23 20:33:09 process 1"
[1] "2022-03-23 20:33:12 process 2"
[1] "2022-03-23 20:33:15 process 3"
[1] "2022-03-23 20:33:18 process 4"
[1] "2022-03-23 20:33:21 process 5"
[1] "2022-03-23 20:33:24 process 6"
[1] "2022-03-23 20:33:27 process 7"
[1] "2022-03-23 20:33:30 process 8"
[1] "2022-03-23 20:33:33 process 9"
[1] "2022-03-23 20:33:36 process 10"

If we instead run the script with three cores, we can see from the printed times that the functions are executed in groups of 3:

$ Rscript mclapply.R 3
[1] "2022-03-23 20:29:56 process 1"
[1] "2022-03-23 20:29:56 process 2"
[1] "2022-03-23 20:29:56 process 3"
[1] "2022-03-23 20:29:59 process 4"
[1] "2022-03-23 20:29:59 process 5"
[1] "2022-03-23 20:29:59 process 6"
[1] "2022-03-23 20:30:02 process 7"
[1] "2022-03-23 20:30:02 process 8"
[1] "2022-03-23 20:30:02 process 9"
[1] "2022-03-23 20:30:05 process 10"

Finally, if we use 10 cores (and the machine we are using has at least 10 cores), all the functions execute simultaneously:

- Rscript mclapply.R 10
[1] "2022-03-23 20:34:59 process 1"
[1] "2022-03-23 20:34:59 process 2"
[1] "2022-03-23 20:34:59 process 3"
[1] "2022-03-23 20:34:59 process 4"
[1] "2022-03-23 20:34:59 process 5"
[1] "2022-03-23 20:34:59 process 6"
[1] "2022-03-23 20:34:59 process 7"
[1] "2022-03-23 20:34:59 process 8"
[1] "2022-03-23 20:34:59 process 9"
[1] "2022-03-23 20:34:59 process 10"

This introduction to parallelization in R will likely be sufficient for most simple parallel tasks, but there are many details that we won’t cover here. For more in-depth explanation of how parallelism in R works refer to the links in the read more box below.