styler
packagescale()
RStudio is a highly effective tool for developing R code, and most analyses we conduct are suitable for running in its interactive environment. However, there are certain circumstances where running R scripts in this interactive manner is not optimal or even suitable:
In these cases, it is necessary to run R scripts outside of RStudio, often on a command line interface (CLI) like that found on linux clusters and cloud based virtual machines. While most scripts can be run in an R interpreter on the CLI and produce the same behavior, there are some steps that help convert an R script into a tool that can be run in this way. This process of transforming an R script in to a more generally usable tool that is run on the CLI is termed toolification in this text. The following sections describe some strategies for toolifying and running R scripts developed in RStudio on the CLI.
It bears mentioning that RStudio and R are related but independent programs. Specifically, RStudio runs the R program behind its interface using what is called the R interpreter. When you run R on a CLI, you are given an interactive interface where commands written in the R language can be evaluated. In RStudio, the interpreter that is running behind the scenes can be accessed in the Console tab at the bottom right:
The interpreter by itself is not very useful, since most meaningful analyses require many lines of code be run sequentially as a unit. The interpreter can be helpful to test out individual lines of code and examine help documentation for R functions.
The simplest way to run an R script from the command line is to first run an R
interpreter and then use the source()
function,
which accepts a filename for an R script as its first argument an executes all
lines of code in the script.
$ cat simple_script.R
print('hello moon')
print(1+2)
print(str(c(1,2,3,4)))
$ R
> source("simple_script.R")
[1] "hello moon"
[1] 3
num [1:4] 1 2 3 4
However, this method requires some interactivity, namely by running an R interpreter first, so it is not sufficient to run in a non-interactive fashion. We may need to run our script without any interactivity, for example when it is a part of a computational pipeline.
Rscript
The R program also includes the Rscript
command
that can be run on the command line:
$ Rscript
Usage: /path/to/Rscript [--options] [-e expr [-e expr2 ...] | file] [args]
--options accepted are
--help Print usage and exit
--version Print version and exit
--verbose Print information on progress
--default-packages=list
Where 'list' is a comma-separated set
of package names, or 'NULL'
or options to R, in addition to --no-echo --no-restore, such as
--save Do save workspace at the end of the session
--no-environ Don't read the site and user environment files
--no-site-file Don't read the site-wide Rprofile
--no-init-file Don't read the user R profile
--restore Do restore previously saved objects at startup
--vanilla Combine --no-save, --no-restore, --no-site-file
--no-init-file and --no-environ
'file' may contain spaces but not shell metacharacters
Expressions (one or more '-e <expr>') may be used *instead* of 'file'
See also ?Rscript from within R
This CLI command accepts an R script as an input and executes the commands in
the file as if they had been passed to source()
, for example:
$ cat simple_script.R
print('hello moon')
print(1+2)
print(str(c(1,2,3,4)))
$ Rscript simple_script.R
[1] "hello moon"
[1] 3
num [1:4] 1 2 3 4
$
This is the simplest way to toolify an R script; simply run it on the command
line with Rscript
. Toolifying simple R scripts that do not need to accept
inputs to execute on different data generally require no changes.
Rscript
is a convenience function that runs the following R
command:
$ Rscript simple_script.R # is equivalent to:
$ R --no-echo --no-restore --file=simple_script.R
If for some reason you cannot run Rscript
directly, you can use these
arguments with the R
command to attain the same result.
commandArgs()
However, sometimes we may wish to control the behavior of a script directly from
the command line, rather than editing the script directly for every different
execution. To pass information into the script when it is run, we can pass
arguments with the Rscript
command:
$ Rscript simple_script.R abc
[1] "hello moon"
[1] 3
num [1:4] 1 2 3 4
Although we passed in the argument abc
, the output of the script didn’t change
because the script wasn’t written to receive it. In order for a script to gain
access to the command line arguments, we must call the commandArgs()
function:
<- commandArgs(trailingOnly=TRUE) args
Now when we execute the script, the arguments passed in are available in the
args
variable:
$ cat echo_args.R
print(commandArgs(trailingOnly=TRUE))
$ Rscript echo_args.R abc
[1] "abc"
$ Rscript echo_args.R abc 123
[1] "abc" "123"
$ Rscript echo_args.R # no args
character(0)
In the last case when no arguments were passed, R is telling us the args
variable is a character vector of length zero.
By default, the commandArgs()
function will return the full command that was
run, including the Rscript
command itself and any additional arguments:
$ Rscript -e "commandArgs()" abc 123
[1] "/usr/bin/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "-e"
[5] "commandArgs()"
[6] "--args"
[7] "abc"
[8] "123"
The trailingOnly=TRUE
argument returns only the arguments provided at the end
of the command, after the Rscript
portion:
$ Rscript -e "commandArgs(trailingOnly=TRUE)" abc 123
[1] "abc" "123"
Note you can provide individual commands instead of a script to Rscript
with
the -e
argument.
The commandArgs()
function is all that is needed to toolify a R script.
Consider a simple script named inspect_csv.R
that loads in any CSV file and
summarizes it as a tibble:
<- commandArgs(trailingOnly=TRUE)
args if(length(args) != 1) {
cat("Usage: simple_script.R <csv file>\n") # cat() writes characters to the screen
cat("Provide exactly one argument that is a CSV filename\n")
quit(save="no", status=1)
}<- args[1]
fn library(tidyverse)
read_csv(fn)
We can now run the script with Rscript
and give it the filename of a CSV file:
$ cat data.csv
gene,sampleA,sampleB,sampleC
g1,1,35,20
g2,32,56,99
g3,392,583,444
g4,39853,16288,66928
$ Rscript inspect_csv.R data.csv
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.5 ✔ purrr 0.3.4
✔ tibble 3.1.6 ✔ dplyr 1.0.7
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 2.1.2 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Rows: 4 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): gene
dbl (3): sampleA, sampleB, sampleC
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message .
# A tibble: 4 × 4
gene sampleA sampleB sampleC
<chr> <dbl> <dbl> <dbl>
1 g1 1 35 20
2 g2 32 56 99
3 g3 392 583 444
4 g4 39853 16288 66928
Note the input handling of the arguments, where the usage of the script and a helpful error message is printed out before the script quits if there was not exactly one argument provided.
In the example above, a filename was passed into the script as an argument. A
filename is encoded as a character string in this case, and commandArgs()
always produces a vector of strings. If we need to pass in arguments to control
numerical values, we will need to parse the arguments first. Consider the
following R implementation of the linux command head
, which prints only the
top \(n\) lines of a file to the screen:
<- commandArgs(trailingOnly=TRUE)
args if(length(args) != 2) {
cat("Usage: head.R <filename> <N>\n")
cat("Provide exactly two arguments: a CSV filename and an integer N\n")
quit(save="no", status=1)
}
# read in arguments
<- args[1]
fn <- as.integer(args[2])
n
# the csv output will include a header row, so reduce n by 1
<- n-1
n
# suppressMessages() prevents messages like library loading text from being printed to the screen
suppressMessages(
{library(tidyverse, quietly=TRUE)
read_csv(fn) %>%
slice_head(n=n) %>%
write_csv(stdout())
} )
We again tested the number of arguments passed in for correct usage, and then
assigned the arguments to variables. The n
argument is cast from a character
string to an integer in the process, enabling its use in the
dplyr::slice_head()
function. We can print out the first three lines of a file
as follows:
$ Rscript head.R data.csv 3
gene,sampleA,sampleB,sampleC
g1,1,35,20
g2,32,56,99
Reading command line arguments into variables in a script can become tedious if
your script has a large number of arguments. Fortunately, the argparser
package can help handle many of the
repetitive operations, including specifying arguments, providing default values,
automatically casting to appropriate types like numbers, and printing out usage
information:
library(argparser, quietly=TRUE)
<- arg_parser("R implementation of GNU coreutils head command")
parser <- add_argument(parser, "filename", help="file to print lines from")
parser <- add_argument(parser, "-n", help="number of lines to print", type="numeric", default=10)
parser <- parse_args(parser, c("-n",3,"data.csv"))
parser print(paste("printing from file:",parser$filename))
## [1] "printing from file: data.csv"
print(paste("printing top n:",parser$n))
## [1] "printing top n: 3"
With this library, we can rewrite our head.R
script to be more concise:
library(argparser, quietly=TRUE)
# instantiate parser
<- arg_parser("R implementation of GNU coreutils head command")
parser
# add arguments
<- add_argument(
parser
parser,arg="filename",
help="file to print lines from"
)<- add_argument(
parser
parser,arg="--top",
help="number of lines to print",
type="numeric",
default=10,
short='-n'
)
<- parse_args(parser)
args
<- args$filename
fn
# the csv output will include a header row, so reduce n by 1
<- args$top-1
n
suppressMessages(
{library(tidyverse, quietly=TRUE)
read_csv(fn) %>%
slice_head(n=n) %>%
write_csv(stdout())
} )
Note we didn’t have to explicitly parse the top
argument as an integer because
type="numeric"
handled this for us. We can print out the top three lines of
our file like we did above using the new parser and arguments:
$ Rscript head.R -n 3 data.csv
gene,sampleA,sampleB,sampleC
g1,1,35,20
g2,32,56,99
We can also inspect the usage by passing the -h
flag:
$ Rscript head.R -h
usage: head.R [--] [--help] [--opts OPTS] [--top TOP] filename
R implementation of GNU coreutils head command
positional arguments:
filename file to print lines from
flags:
-h, --help show this help message and exit
optional arguments:
-x, --opts RDS file containing argument values
-n, --top number of lines to print [default: 10]
These are the only tools required to toolify our R script.
Note that scripts developed in RStudio can be run on the command line, but scripts written for CLI use with the strategies above cannot be easily run inside RStudio! However, if you followed good practices and implemented your script as a set of functions, you can easily write a CLI wapper script that calls those functions, thereby enabling you to continue developing your code in RStudio and maintain CLI tool functionality.