styler
packagescale()
Most modern computers have multiple processing or CPU cores, e.g. a 4-core or 8-core processor. This principally means that these computers can execute multiple processes simultaneously thereby dividing the wall clock time required to run the computations on a one core machine by a factor equal to the number of cores the machine has. This can be a very meaningful performance improvement, e.g. an 8 core machine could reduce the running time of a computation from a week to less than a day. Some data centers possess compute nodes with 28 or even 64 cores, meaning a computation that would normally run for two months could complete in a single day if all cores are used! Processes that are running at the same time on different cores are said to be running in parallel, while a computation that is not parallelized runs serially.
Not all computations can be parallelized, and some can only be made parallel by using sophisticated mathematical models and algorithms. However, certain classes of problems are pleasingly parallel, meaning their structure makes it easy if not trivial to divide into parts that can be run in parallel. In general, how “parallelizeable” a computation is depends on how the subparts of the computation relate to one another.
Consider the problem of mapping high throughput sequencing
reads against a reference genome. If the task is
to identify the location(s) in the genome where each read matches, the locations
of one read do not influence the locations of another read; they are independent
of each other. Since they are independent, in principle they can be done in
parallel, meaning that, again in priniple, every one of the millions of reads
could be aligned against the genome at the same time. This would mean the wall
clock time required to align all the reads in a multi-million read dataset would
take as long as it takes to align a single read. This makes short read alignment
a pleasingly parallel problem. Unfortunately, in practice there are technical
constraints preventing this increase in performance, but most modern alignment
programs like bwa
and
STAR
exploit this inherently parallel
structure to improve performance by using multiple cores at the same time.
In general, splitting up a computation into pieces that can run in parallel does not always lead to performance improvements. There are several ways computations are constrained based on which aspects of the algorithm or program require the most time to run. Briefly, an algorithm can be:
These concepts are covered in the field of high-performance computing and are beyond the scope of this class. However, for the purposes of this section, the problems we are concerned with when parallelizing computations are compute-bound computations.
Parallelism can be attained on both the multiprocessor and the distributed
computing level. In distributed computing, different subtasks of a
parallelizable computation are executed on different physical computers. Each
of these subtasks could possibly be further parallelized using multiple cores on
each system. In cluster computing or cloud environments, enormous performance
improvements can be attained by utilizing both levels of parallelization. For
example, if you submit 100 jobs via qsub
to a compute cluster and they all run
at the same time, and each job uses 16 cores on its local machine, that amounts
to a 1,600x speedup. This parallelism would reduce a computation taking over 4
years to run in a single day!
apply
and Friends Are Pleasingly ParallelIn R, loops are most efficiently expressed using the apply
function and its variants lapply
and vapply
. Recall
that these functions accept a function and a collection of inputs, usually a
list. The apply
function iterates over the input collection and executes the
function on each of them. In general, the inputs are assumed to be independent,
and the function does not have any outside dependencies. This is precisely the
pattern needed for pleasantly parallel tasks. In general, any iteration using an
apply
function can be parallelized in R.
parallel
packageR can leverage multiple core architectures to execute processes in parallel with
the [parallel
package], which is installed as a base R package. The main
function in this package that enables parallelism is
mclapply()
,
or m
ultic
ore apply
. mclapply
accepts all of the same arguments as
lapply
and a few additional starting with mc.
. The argument to mclapply
that enables parallelism is mc.cores
, which specifies the number of cores to
use when executing tasks in parallel.
mclapply
is not available on Windows, only Mac and linux.
Consider the following script mclapply.R
which runs a function that prints out
its input with the current time and then waits (i.e. sleeps) for three seconds:
library(parallel)
<- commandArgs(trailingOnly=TRUE)
args <- as.integer(args[1])
cores
<- mclapply(
ret 1:10,
function(x) {
print(paste(Sys.time(),'process',x))
Sys.sleep(3)
},mc.cores=cores
)
Note the number of cores to be used is passed in as a command line argument. If we run the script with one core, we can see the functions run serially:
$ Rscript mclapply.R 1
[1] "2022-03-23 20:33:09 process 1"
[1] "2022-03-23 20:33:12 process 2"
[1] "2022-03-23 20:33:15 process 3"
[1] "2022-03-23 20:33:18 process 4"
[1] "2022-03-23 20:33:21 process 5"
[1] "2022-03-23 20:33:24 process 6"
[1] "2022-03-23 20:33:27 process 7"
[1] "2022-03-23 20:33:30 process 8"
[1] "2022-03-23 20:33:33 process 9"
[1] "2022-03-23 20:33:36 process 10"
If we instead run the script with three cores, we can see from the printed times that the functions are executed in groups of 3:
$ Rscript mclapply.R 3
[1] "2022-03-23 20:29:56 process 1"
[1] "2022-03-23 20:29:56 process 2"
[1] "2022-03-23 20:29:56 process 3"
[1] "2022-03-23 20:29:59 process 4"
[1] "2022-03-23 20:29:59 process 5"
[1] "2022-03-23 20:29:59 process 6"
[1] "2022-03-23 20:30:02 process 7"
[1] "2022-03-23 20:30:02 process 8"
[1] "2022-03-23 20:30:02 process 9"
[1] "2022-03-23 20:30:05 process 10"
Finally, if we use 10 cores (and the machine we are using has at least 10 cores), all the functions execute simultaneously:
- Rscript mclapply.R 10
[1] "2022-03-23 20:34:59 process 1"
[1] "2022-03-23 20:34:59 process 2"
[1] "2022-03-23 20:34:59 process 3"
[1] "2022-03-23 20:34:59 process 4"
[1] "2022-03-23 20:34:59 process 5"
[1] "2022-03-23 20:34:59 process 6"
[1] "2022-03-23 20:34:59 process 7"
[1] "2022-03-23 20:34:59 process 8"
[1] "2022-03-23 20:34:59 process 9"
[1] "2022-03-23 20:34:59 process 10"
This introduction to parallelization in R will likely be sufficient for most simple parallel tasks, but there are many details that we won’t cover here. For more in-depth explanation of how parallelism in R works refer to the links in the read more box below.