Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread
“There are only two hard things in Computer Science: cache invalidation and naming things.”
— Phil Karlton
Variable and function names should use only lowercase letters, numbers, and _
. Use underscores (_
) (so called snake case) to separate words within a name.
# Good day_one day_1 # Bad DayOne dayone
Always put a space after a comma, never before, just like in regular English.
# Good x[, 1] # Bad x[,1] x[ ,1] x[ , 1]
Do not put spaces inside or outside parentheses for regular function calls.
# Good mean(x, na.rm = TRUE) # Bad mean (x, na.rm = TRUE) mean( x, na.rm = TRUE )
Most infix operators (==
, +
, -
, <-
, etc.) should always be surrounded by spaces:
# Good height <- (feet * 12) + inches mean(x, na.rm = TRUE) # Bad height<-feet*12+inches mean(x, na.rm=TRUE)
# Good mean(1:10, na.rm = TRUE) # Bad mean(x = 1:10, , FALSE) mean(, TRUE, x = c(1:10, NA))
Curly braces, {}
, define the most important hierarchy of R code
{
should be the last character on the line}
should be the first character on the line.# Good if (y < 0 && debug) { message("y is negative") } # Bad if (y < 0 && debug) { message("Y is negative") }
#Good if (y == 0) { if (x > 0) { log(x) } else { message("x is negative or zero") } } else { y^x } # Bad if (y == 0) { if (x > 0) { log(x) } else { message("x is negative or zero") } } else { y ^ x }
Strive to limit your code to 80 characters per line
# Good do_something_very_complicated( something = "that", requires = many, arguments = "some of which may be long" ) # Bad do_something_very_complicated("that", requires, many, arguments, "some of which may be long" )
Use <-
, not =
, for assignment.
# Good x <- 5 # Bad x = 5
styler
packagestyler
package automatically formats code in specific waysstyler
packageinstall.packages("styler")
in RStudio, a new entry is available in the Addins menu:Some use cases make running scripts in RStudio ineffective
Must turn an R script into a tool
source()
function, to execute all code in a file in an R interpreter$ cat simple_script.R print('hello moon') print(1+2) print(str(c(1,2,3,4))) $ R > source("simple_script.R") [1] "hello moon" [1] 3 num [1:4] 1 2 3 4
Rscript
Rscript
command that can be run on the command line$ Rscript Usage: /path/to/Rscript [--options] [-e expr [-e expr2 ...] | file] [args] --options accepted are --help Print usage and exit --version Print version and exit --verbose Print information on progress --default-packages=list Where 'list' is a comma-separated set of package names, or 'NULL' or options to R, in addition to --no-echo --no-restore, such as --save Do save workspace at the end of the session --no-environ Don't read the site and user environment files --no-site-file Don't read the site-wide Rprofile --no-init-file Don't read the user R profile --restore Do restore previously saved objects at startup --vanilla Combine --no-save, --no-restore, --no-site-file --no-init-file and --no-environ
Rscript
Example$ cat simple_script.R print('hello moon') print(1+2) print(str(c(1,2,3,4))) $ Rscript simple_script.R [1] "hello moon" [1] 3 num [1:4] 1 2 3 4 $
Rscript
SpecificsRscript
is a convenience function that runs the following R
command:$ Rscript simple_script.R # is equivalent to: $ R --no-echo --no-restore --file=simple_script.R
commandArgs()
Rscript
command$ Rscript simple_script.R abc [1] "hello moon" [1] 3 num [1:4] 1 2 3 4
commandArgs()
commandArgs()
function to access command line arguments in the shellargs <- commandArgs(trailingOnly=TRUE)
commandArgs()
args
variable:$ cat echo_args.R print(commandArgs(trailingOnly=TRUE)) $ Rscript echo_args.R abc [1] "abc" $ Rscript echo_args.R abc 123 [1] "abc" "123" $ Rscript echo_args.R # no args character(0)
trailingOnly=FALSE
commandArgs()
includes the Rscript
command itself and any additional arguments:$ Rscript -e "commandArgs()" abc 123 [1] "/usr/bin/R" [2] "--no-echo" [3] "--no-restore" [4] "-e" [5] "commandArgs()" [6] "--args" [7] "abc" [8] "123"
trailingOnly=TRUE
The trailingOnly=TRUE
argument returns only the arguments provided at the end of the command, after the Rscript
portion:
$ Rscript -e "commandArgs(trailingOnly=TRUE)" abc 123 [1] "abc" "123"
args <- commandArgs(trailingOnly=TRUE) if(length(args) != 1) { # cat() writes characters to the screen cat("Usage: simple_script.R <csv file>\n") cat("Provide exactly one argument that is a CSV filename\n") quit(save="no", status=1) } fn <- args[1] library(tidyverse) read_csv(fn)
$ cat data.csv gene,sampleA,sampleB,sampleC g1,1,35,20 g2,32,56,99 g3,392,583,444 g4,39853,16288,66928
$ Rscript inspect_csv.R data.csv # A tibble: 4 Ă— 4 gene sampleA sampleB sampleC <chr> <dbl> <dbl> <dbl> 1 g1 1 35 20 2 g2 32 56 99 3 g3 392 583 444 4 g4 39853 16288 66928
commandArgs()
always produces a vector of stringshead.R
args <- commandArgs(trailingOnly=TRUE) if(length(args) != 2) { cat("Usage: head.R <filename> <N>\n") cat("Provide exactly two arguments: a CSV filename and an integer N\n") quit(save="no", status=1) } # read in arguments fn <- args[1] n <- as.integer(args[2])
head.R
# the csv output will include a header row, so reduce n by 1 n <- n-1 # suppressMessages() prevents messages like library loading text from being printed to the screen suppressMessages( { library(tidyverse, quietly=TRUE) read_csv(fn) %>% slice_head(n=n) %>% write_csv(stdout()) } )
head.R
$ Rscript head.R data.csv 3 gene,sampleA,sampleB,sampleC g1,1,35,20 g2,32,56,99
argparser
argparser
package helps build CLI argument interfaceslibrary(argparser, quietly=TRUE) parser <- arg_parser("R implementation of GNU coreutils head command") parser <- add_argument(parser, "filename", help="file to print lines from") parser <- add_argument(parser, "-n", help="number of lines to print", type="numeric", default=10) parser <- parse_args(parser, c("-n", 3, "data.csv")) print(paste("printing from file:", parser$filename)) print(paste("printing top n:", parser$n))
head.R + argparse
library(argparser, quietly=TRUE) parser <- arg_parser("R implementation of GNU coreutils head command") parser <- add_argument( parser, arg="filename", help="file to print lines from" ) parser <- add_argument( parser, arg="--top", help="number of lines to print", type="numeric", default=10, short='-n' )
head.R + argparse
args <- parse_args(parser) fn <- args$filename # the csv output will include a header row, so reduce n by 1 n <- args$top-1 suppressMessages( { library(tidyverse, quietly=TRUE) read_csv(fn) %>% slice_head(n=n) %>% write_csv(stdout()) } )
head.R + argparse
$ Rscript head.R -n 3 data.csv gene,sampleA,sampleB,sampleC g1,1,35,20 g2,32,56,99
head.R + argparse
$ Rscript head.R -h usage: head.R [--] [--help] [--opts OPTS] [--top TOP] filename R implementation of GNU coreutils head command positional arguments: filename file to print lines from flags: -h, --help show this help message and exit optional arguments: -x, --opts RDS file containing argument values -n, --top number of lines to print [default: 10]
bwa
and STAR
exploit this inherently parallel structure of read alignment problemqsub
and each job uses 16 cores = \(100 * 16 =\) 1600x speedup!apply
Is Pleasingly Parallelapply
, lapply
, vapply
etc computations are pleasantly parallelapply
iterates over the input collection and executes the function on each of themapply
function can be parallelized in Rparallel
packageparallel
package]mclapply()
, or m
ultic
ore apply
is multicore implementation of apply
mclapply
accepts all of the same arguments as lapply
and a few additional starting with mc.
mc.cores
, which specifies the number of cores to use when executing tasks in parallelmclapply
is not available on Windows, only Mac and linuxmclapply
Examplelibrary(parallel) args <- commandArgs(trailingOnly=TRUE) cores <- as.integer(args[1]) ret <- mclapply( 1:10, function(x) { print(paste(Sys.time(),'process',x)) Sys.sleep(3) }, mc.cores=cores )
mclapply
Example - 1 core$ Rscript mclapply.R 1 [1] "2022-03-23 20:33:09 process 1" [1] "2022-03-23 20:33:12 process 2" [1] "2022-03-23 20:33:15 process 3" [1] "2022-03-23 20:33:18 process 4" [1] "2022-03-23 20:33:21 process 5" [1] "2022-03-23 20:33:24 process 6" [1] "2022-03-23 20:33:27 process 7" [1] "2022-03-23 20:33:30 process 8" [1] "2022-03-23 20:33:33 process 9" [1] "2022-03-23 20:33:36 process 10"
mclapply
Example - 3 cores$ Rscript mclapply.R 3 [1] "2022-03-23 20:29:56 process 1" [1] "2022-03-23 20:29:56 process 2" [1] "2022-03-23 20:29:56 process 3" [1] "2022-03-23 20:29:59 process 4" [1] "2022-03-23 20:29:59 process 5" [1] "2022-03-23 20:29:59 process 6" [1] "2022-03-23 20:30:02 process 7" [1] "2022-03-23 20:30:02 process 8" [1] "2022-03-23 20:30:02 process 9" [1] "2022-03-23 20:30:05 process 10"
mclapply
Example - 10 cores- Rscript mclapply.R 10 [1] "2022-03-23 20:34:59 process 1" [1] "2022-03-23 20:34:59 process 2" [1] "2022-03-23 20:34:59 process 3" [1] "2022-03-23 20:34:59 process 4" [1] "2022-03-23 20:34:59 process 5" [1] "2022-03-23 20:34:59 process 6" [1] "2022-03-23 20:34:59 process 7" [1] "2022-03-23 20:34:59 process 8" [1] "2022-03-23 20:34:59 process 9" [1] "2022-03-23 20:34:59 process 10"
library(ggplot2) ggplot2::diamonds
## # A tibble: 53,940 Ă— 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 ## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 ## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 ## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 ## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 ## # â„ą 53,930 more rows
summary()
function:diamonds <- ggplot2::diamonds summary(diamonds$carat) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 0.20 0.40 0.70 0.80 1.04 5.01 summary(diamonds$cut) #> Fair Good Very Good Premium Ideal #> 1610 4906 12082 13791 21551
class Read(object): def __init__(self, header, seq, qual): self.header = header self.seq = seq self.qual = qual def to_fastq(self): return '@{}\n{}\n+\n{}'.format( self.header, self.seq, self.qual ) read1 = Read("read1", "ACTG", "FFFF") read1.seq # access sequence read1.fastq() # output fastq format
new_Read <- function(header, seq, qual=NULL) { structure( list( header=header, seq=seq, qual=qual ), class="Read" ) } read1 <- new_Read("read1", "ACGT", "FFFF") fastq.Read <- function(x) { UseMethod("fastq") paste0("@",self$header,"\n",self$seq,"+\n",self$qual,"\n") } read1$seq fastq(read1)
Read <- R6Class("Read", list( header = NULL, seq = NULL, qual = NULL, initialize = function(header, seq, qual=NULL) { self$header <- header self$seq <- seq self$qual <- qual }, fastq = function() { invisible(paste0("@",self$header,"\n",self$seq,"+\n",self$qual,"\n")) } } } )) read1 <- Read$new("read1","ACTG","FFFF") read1$seq read1$fastq()
Packages are the fundamental unit of reproducible code in R. Most of the time, we as practitioners simply use packages that have been provided by others. However, sometimes, when we have developed a methodology that others might find useful, it may be advantageous to write our own R packages and share them with the community. Writing an R package involves creating a directory with a specific structure and adding code, documentation, and package metadata in a way that allows it to be distributed.
Hadley Wickham and Jenny Bryan have written an excellent book on how to write R packages. If ever you need to write your own R package, this should be the only place you need to go.
Creates reproducible environments for R projects, similar to conda and git
Attempts to create isolated, portable and reproducible environments in R
renv::install() can install from CRAN, github, bioconductor, etc.
You can specify exact versions of the package you want installed
renv::snapshot() will record any newly installed packages in your environment
Updates the lockfile with the current state of dependencies
The lockfile can be used to restore an environment if needed
renv.lock, .Rprofile, renv/settings.json and renv/activate.R all describe your environment
Check files into a git repository to share with yourself or a collaborator
Renv will be automatically downloaded when the project is opened and it will ask you to restore the environment if it doesn’t already exist