The sequencing of the human genome ushered in the “post-genome” biological revolution
Biology is now a data science
Modern biological data analysis requires skills and knowledge from many domains:
No one person can be expert in all these areas!
Experts create tools and techniques that we can use.
tidyverse
is a set of R packages first developed by Hadley WickhamLearn R and its related packages to analyze biological data
Communicate results of R analyses with effective visualizations and notebooks
Learn how to use the RStudio development environment
Write correct and reproducible code using formal testing strategies
Share analyses with others using RShiny applications
You are: a practicing biologist wishing to learn how to use R to analyze biological data
We assume a basic working knowledge of:
We endeavor to explain required background whenever possible to relax these assumptions
Weekly lectures
7 assignments, roughly one per week
Final project: RShiny application combining the techniques you learned in the assignments
Grading:
Zoom link will be provided in email / blackboard
Your physical, emotional and mental health
Your family and friends
Policy on absences / missed classes
Detailed instructions are in the book
Assignments hosted on GitHub
We will use GitHub Classroom to make them available
Many tools are required in a complete bioinformatics analysis
python
R
, nextflow
/snakemake
cat
, grep
, etcbowtie
, STAR
)bash
Learning which tools are appropriate for which steps in an analysis are learned with time!
By default, RStudio preserves your R environment when you shut it down and restores it when you start it again.
This is very bad practice! Turn it off right away!
Open the Tools > Global Options… menu and:
R is both a programming language and a computing environment
A R program contains many lines of code that are executed in sequence
A R script is a file that contains lines of R code
Name your R script names to be descriptive but concise:
do_it.R
a_script_with_very_cool_analysis_and_plots.R
analyze_gene_expression.R
File -> New File -> R Script
Common workflow:
Control-Enter
on WindowsCommand-Enter
on Mac# stores the result of 1+1 into a variable named 'a' a <- 1+1
All research results must be communicated at some point
Results are encoded in text, tables, and plots
Methods are described in text, and (ideally) the code itself
Providing code can make analyses more reproducible
Code notebooks are tools that combine code, results, and text into one place
A markup language annotates and decorates plain text with information about its formatting and structure
Both machine- and human-readable
The same markup text might be converted into different formats, e.g. HTML or PDF
markdown is one such markup language
You can *emphasize* text, or **really emphasize it**. Lists are pretty easy to read as well: * item 1 * item 2 * item 3
You can emphasize text, or really emphasize it.
Lists are pretty easy to read as well:
If you need an enumerated list you can do that too: 1. item 1 2. item 2 3. item 3
If you need an enumerated list you can do that too:
You can easily include links to web sites like [Google](http://google.com) and images: ![kitty](https://upload.wikimedia.org/wikipedia/commons/b/bc/ Juvenile_Ragdoll.jpg)You can easily include links to web sites like Google and images:
+------+------+-----+-------+ | some | data | and | stuff | +======+======+=====+=======+ | A | 1 | 2 | 3 | +------+------+-----+-------+ | B | 4 | 5 | 6 | +------+------+-----+-------+
some | data | and | stuff |
---|---|---|---|
A | 1 | 2 | 3 |
B | 4 | 5 | 6 |
.Rmd
Lines starting with ```{r}
define a special code block
Example:
``` r a <- 1 ```
R code in the block will be executed
Code output placed below the block in the report
RStudio shows you the output of the notebook within its interface
Code you write changes over time
We would like to maintain previous versions of code, in case new changes break the code
Simple approach: make copies of scripts and rename them:
my_R_script.R
my_R_script_v2.R
my_R_script_v2_BAD.R
my_R_script_v2_final.R
my_R_script_v2_final_revision.R
A bug is a piece of code that produces an incorrect or undesirable result
Bugs are normal! You’re not a bad programmer if your code has bugs.
There are two types of bugs:
git
git
conceptsgit init
in a new directory)git
workflowAt the beginning, create new repo git init .
or with graphical interface
git add <filename>...
git status
, verify changes are as intendedgit commit -m "<commit message"
git
software only works on your local computer with local repositoriesFirst you must have an account on GitHub
Create a new repo on GitHub
Then:
clone
your GitHub repo to create a local copy connected to GitHubremote
Now your local repo is connected to the same repo on GitHub
Changes you make to your local files can be sent, or push
ed to the repo on GitHub
git
+GitHub Workflowgit pull
git add <filename>...
git status
, verify changes are as intendedgit commit -m "<commit message"
git push
R (like all programming languages) is basically a fancy calculator:
1 + 2 # addition [1] 3 3 - 2 # subtraction [1] 1 4 * 2 # multiplication [1] 8 4 / 2 # division [1] 2
The [1]
lines above are the output given by R when the preceding expression is run
Any portion of a line starting with a #
is a comment and ignored by R
1.234 + 2.345 - 3.5*4.9 # numbers can have decimals [1] -13.571 1.234 + (2.345 - 3.5)*4.9 # expressions can contain parentheses [1] -4.4255 2**2 # exponentiation [1] 4 4**(1/2) # square root [1] 2 9**(1/3) # cube root [1] 3
R assigns values to symbolic placeholders called variables
Expressions can be assigned into a variable with a name using the <-
operator:
new_var <- 1 + 2
Variables values are used in later execution:
new_var - 2 [1] 1 another_var <- new_var * 4
<-
” vs “=
”The correct way to assign a value to a variable in R is with the <-
syntax
Many other programming languages use =
=
assignment syntax does work in R:
new_var = 2 # works, but is not common convention!
BUT this is considered bad practice and may cause confusion later
You should always use the <-
syntax when assigning values to variables!
.
has no special meaning in R.
does not have a special meaning like it does in many other languages like python, C, javascript, etc.new.var
is a valid variable name just like new_var
.
characters in your variable names to reduce the chances of conflicts and confusionTRUE
/FALSE
)1.0
or 1e-5
for \(10^{-5}\)1.123
) and integer numbers (1
)TRUE
/FALSE
- Logical or boolean values
TRUE
stored as the number 1
FALSE
stored as the number 0
Inf
/-Inf
- Infinity - special type that indicates division by 0NaN
- “impossible” value for the expression 0/0
complex
functionNA
- a special value that indicates a value is missingNULL
. Similar to NA
, NULL
indicates that a vector, rather than a value, is missingNA
, NaN
, and Inf
R cannot perform computations on NA
, NaN
, or Inf
values
These values have an ‘infectious’ quality to them
When mixed in with other values, the result of the computation reverts to the first of these values encountered:
# this how to create a vector of 4 values in R x <- c(1,2,3,NA) mean(x) # compute the mean of values that includes NA [1] NA mean(x,na.rm=TRUE) # remove NA values prior to computing mean [1] 2 mean(c(1,2,3,NaN)) [1] NaN mean(c(NA,NaN,1)) [1] NA
"value"
. R can store character data in the form of strings. Note R does not interpret string values by default, so "1"
and 1
are distinct.Data structures in R (and other languages) are ways of storing and organizing more than one value together
Most basic data structure in R is a one dimensional sequence of values called a vector:
# the c() function creates a vector x <- c(1,2,3) [1] 1 2 3
The vector in R has a special property that all values contained in the vector must have the same type
When constructing a vector, R will coerce values to the most general type if it encounters values of different types:
c(1,2,"3") [1] "1" "2" "3" c(1,2,TRUE,FALSE) [1] 1 2 1 0 c(1,2,NA) # note missing values stay missing [1] 1 2 NA c("1",2,NA,NaN) # NA stays, NaN is cast to a character type [1] "1" "2" NA "NaN"
vectors also have a length, which is defined as the number of elements in the vector
x <- c(1,2,3) length(x) [1] 3
R is much more efficient at operating on vectors than individual elements separately
With numeric vectors, you can perform arithmetic operations on vectors of compatible size just as easily as individual values
c(1,2) + c(3,4) [1] 4 6 c(1,2) * c(3,4) [1] 3 8
R multiplies vectors of different length in a strange way:
c(1,2) * c(3,4,5) # lengths 2 and 3 not evenly divisible! [1] 3 8 5 Warning message: In c(1, 2) * c(3, 4, 5) : longer object length is not a multiple of shorter object length
Cycles through values in each vector until all values are used
Above, c(1,2) * c(3,4,5)
yields: \(1*3\), \(2*4\), \(1*5\), weird!
When the vector lengths are evenly divisible, no warning raised:
c(1,2) * c(3,4,5,6) # yields: 1*3 2*4 1*5 2*6 [1] 3 8 5 12
Factors are objects that R uses to handle categorical variables
We create a factor from a vector of character strings using the factor()
function
case_status <- factor( c('Disease','Disease','Disease', 'Control','Control','Control' ) ) case_status [1] Disease Disease Disease Control Control Control Levels: Control Disease
The distinct values in the factor are called levels
Internally, a factor is stored as a vector of integers where each level has the same value:
as.numeric(case_status) [1] 2 2 2 1 1 1 str(case_status) Factor w/ 2 levels "Control","Disease": 2 2 2 1 1 1
By default, R assigns integers to levels in alphanumeric order
"Control"
is set to 1, "Disease"
is set to 2You can change which levels are assigned to which number
The integers assigned to each level can be specified explicitly when creating the factor:
case_status <- factor( c('Disease','Disease','Disease','Control','Control','Control'), levels=c('Disease','Control') ) case_status [1] Disease Disease Disease Control Control Control Levels: Control Disease str(case_status) Factor w/ 2 levels "Disease","Control": 1 1 1 2 2 2
The base R functions read.csv
/read.table
load columns with character values as factors by default
You may turn this off with stringsAsFactors=FALSE
to read.csv()
You may change the level mapping of an existing factor by releveling it by passing a factor into the factor()
function and specifying new levels
str(case_status) Factor w/ 2 levels "Disease","Control": 1 1 1 2 2 2 factor(case_status, levels=c("Control","Disease")) Factor w/ 2 levels "Control","Disease": 2 2 2 1 1 1
# create a matrix with two rows and three columns containing integers A <- matrix(c(1,2,3,4,5,6) nrow = 2, ncol = 3, byrow=1 ) A [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 dim(A) # prints out the dimensions of the matrix, rows first [1] 2 3
Matrices can be transposed from \(m \times n\) to be \(n \times m\) using the t()
function
A [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 t(A) [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 dim(t(A)) [1] 3 2
Lists can be created using the list()
function:
my_list <- list( c(1,2,3), c("A","B","C","D") ) my_list [[1]] [1] 1 2 3 [[2]] [1] "A" "B" "C" "D" my_list[[1]] # access the first item of the list [1] 1 2 3 my_list[[2]] # access the second item of the list [1] "A" "B" "C" "D"
Lists can also be defined and indexed by name:
my_list <- list( numbers=c(1,2,3), categories=c("A","B","C","D") ) my_list $numbers [1] 1 2 3 $categories [1] "A" "B" "C" "D" my_list$numbers # access the first item of the list [1] 1 2 3 my_list$categories # access the second item of the list [1] "A" "B" "C" "D"
You may create a data frame with data.frame()
:
my_df <- data.frame( # recall '.' has no special meaning in R numbers=c(1,2,3), categories=c("A","B","C","D") ) Error in data.frame(c(1, 2, 3), c("A", "B", "C", "D")) : arguments imply differing number of rows: 3, 4 my_df <- data.frame( numbers=c(1,2,3), categories=c("A","B","C") ) my_df numbers categories 1 1 A 2 2 B 3 3 C
names()
function:x <- c(1,2,3) names(x) # vectors have no names by default NULL names(x) <- c("a","b","c") x a b c 1 2 3 names(x) [1] "a" "b" "c"
l <- list(a=3.5, b=0.4, c=c(9.1, 5.8), d=7.7) names(l) [1] "a" "b" "c" "d" names(l) <- c("f1", "f2", "f3", "f4") l $f1 [1] 3.5 $f2 [1] 0.4 $f3 [1] 9.1 5.8 $f4 [1] 7.7
rownames
and colnames
m <- matrix(1:9, nrow=3, byrow=T) colnames(m) <- c("c1", "c2", "c3") rownames(m) <- c("r1", "r2", "r3") m c1 c2 c3 r1 1 2 3 r2 4 5 6 r3 7 8 9 m["r1",] # get row 1 c1 c2 c3 1 2 3 m[, "c2"] # get column 2 r1 r2 r3 2 5 8
x <- c(3.5, 0.4, 9.1, 7.7) x[1] [1] 3.5 x[4] [1] 7.7 x[0] # this always returns an empty vector [1] numeric(0) x[5] # accessing indices larger than the vector length returns NA [1] NA
x = [3.5, 0.4, 9.1, 7.7] print(x[0]) 3.5 print(x[3]) 7.7 print(x[4]) # accessing indices larger than the list length raises error Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: list index out of range
There are six different ways to subset vectors in R. The following three are the most common:
TRUE
x <- c(3.5, 0.4, 9.1, 7.7) x[1] [1] 3.5 x[c(1,3)] # can subset using a vector of integers [1] 3.5 9.1 x[c(1,1)] # can select the same element multiple times [1] 3.5 3.5
x <- c(3.5, 0.4, 9.1, 7.7) x[-1] # return all but the first element [1] 0.4 9.1 7.7 x[-c(2,4)] # return the first and third element [1] 3.5 9.1
TRUE
x <- c(3.5, 0.4, 9.1, 7.7) x[c(TRUE,FALSE,FALSE,TRUE)] [1] 3.5 x[c(FALSE,TRUE,TRUE,FALSE)] [1] 0.4 9.1 x > 3 [1] TRUE FALSE TRUE TRUE x[x > 3] [1] 3.5 9.1 7.7
x[<row selectors>, <column selectors>]
selectors
may be any of the methods used to subset a vectorx <- matrix(1:9, nrow=3, byrow=TRUE) x [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 x[c(1,2), c(1,2)] # first two rows, first two columns [,1] [,2] [1,] 1 2 [2,] 4 5
x[-c(1,3), ] # leaving a selector blank selects all, second row and all columns [1] 4 5 6 x[, c(2,3)] # all rows included, last two columns [,1] [,2] [1,] 2 3 [2,] 5 6 [3,] 8 9 x[-c(1,3), c(2)] # select the second row and second column [1] 5
l <- list(3.5, 0.4, c(9.1, 5.8), 7.7)) l[1] [[1]] [1] 3.5 l[c(2,3)] [[1]] [1] 0.4 [[2]] [1] 9.1 5.8
l[c(FALSE, TRUE, TRUE, FALSE)] [[1]] [1] 0.4 [[2]] [1] 9.1 5.8
Note: when indexing a list with [
, the result returned is always another list; we will discuss this more later.
l <- list(a=3.5, b=0.4, c=c(9.1, 5.8), d=7.7) l["a"] $a [1] 3.5 l[c("a","c")] $a [1] 3.5 $c [1] 9.1 5.8 l[c("b")]
[[
and $
[[
and $
operators are used to access single items of data structures individual value of a list or data frame and return only that value,[[
when indexing by integer or name:l <- list(a=3.5, b=0.4, c=c(9.1, 5.8), d=7.7) l[1] # returns a list with a single value of 3.5 [[1]] [1] 3.5 l[[1]] # returns 3.5 [1] 3.5 l[[3]] # returns c(9.1, 5.8) [1] 9.1 5.8
[[
and $
l["a"] # returns list(a=3.5) $a [1] 3.5 l$a # returns 3.5 [1] 3.5 l[["a"]] # also returns 3.5 [1] 3.5
R recognizes logical values as a distinct type
R provides all the conventional infix logical operators:
1 == 1 # equality [1] TRUE 1 != 1 # inequality [1] FALSE 1 < 2 # less than [1] TRUE 1 > 2 # greater than [1] FALSE 1 <= 2 # less than or equal to [1] TRUE 1 >= 2 # greater than or equal to
Logical tests are applied to each element of vectors:
x <- c(1,2,3) x == 2 [1] FALSE TRUE FALSE x < 1 [1] FALSE FALSE FALSE c(1,2) == c(1,3) [1] TRUE FALSE c(1,2) != c(1,3) [1] FALSE TRUE c(1,2) == c(1,2,3) [1] TRUE TRUE FALSE Warning message: In c(1, 2, 3) == c(1, 2) : longer object length is not a multiple of shorter object length
all()
function returns a single boolean value when all results are true:
x <- c(1,2,3) all(x == 2) [1] FALSE all(x > 0) [1] TRUE
R provides many functions of the form is.X
where X
is some type or condition
is.numeric(1) # is the argument numeric? [1] TRUE is.character(1) # is the argument a string? [1] FALSE is.character("ABC") [1] TRUE is.numeric(c(1,2,3)) # recall a vector has exactly one type [1] TRUE is.numeric(c(1,2,"3")) [1] FALSE is.na(c(1,2,NA)) [1] FALSE FALSE TRUE
Functions usually accept and execute on different inputs
e.g. the mean
function wouldn’t be very useful if it didn’t accept a value
mean(c(1,2,3)) [1] 2
The function must accept or allow you to pass it arguments
# a basic function signature function_name(arg1, arg2) # function accepts 2 arguments
# a basic function signature function_name(arg1, arg2) # function accepts 2 arguments
arg1
and arg2
are arguments indicating this function accepts two argumentsfunction_name
is the name of the functionarg1
and arg2
are positional arguments (i.e. order matters)Functions will raise an error if they don’t receive arguments they expect:
mean() # compute the arithmetic mean, but of what? Error in mean.default() : argument "x" is missing, with no default
The specific arguments a function accepts is called the function signature
You can find details about a function signatures using the documentation, as described next
These are the function signatures for the mean()
function:
mean(x, ...) ## Default S3 method: mean(x, trim = 0, na.rm = FALSE, ...)
This function has two signatures
The names of the arguments in the signature (e.g. x
) are the variable names the function uses in its code to refer to the parameters you pass
The named arguments in a function signature are called parameters or formal arguments
mean(x, ...) ## Default S3 method: mean(x, trim = 0, na.rm = FALSE, ...)
mean
function introduces two new types of syntax:
trim = 0
named arguments that have a default value if not provided explicitly....
. This means the mean
function can accept arguments that are not explicitly listed in the signature. This syntax is called dynamic dots.All function arguments can be specified by name
# generate 100 normally distributed samples with mean 0 and # standard deviation 1 my_vals <- rnorm(100,mean=0,sd=1) mean(my_vals) [1] -0.05826857 mean(x=my_vals) [1] -0.05826857
To borrow from the Zen of Python:
“Explicit is better than implicit.”
Passing arguments with their names can help avoid bugs!
...
dynamic dotsThe ...
argument catchall can be very dangerous.
It allows you to provide arguments to a function that have no meaning…
and R will not raise an error:
# generates 100 normally distributed samples with mean 0 and # standard deviation 1 my_vals <- rnorm(100,mean=0,sd=1) mean(x=my_vals,tirm=0.1) [1] -0.05826857
Did you spot the mistake?
...
dynamic dots# generates 100 normally distributed samples with mean 0 and # standard deviation 1 my_vals <- rnorm(100,mean=0,sd=1) mean(x=my_vals,tirm=0.1) [1] -0.05826857 mean(x=my_vals,trim=0.1) [1]-0.02139839
Don’t Repeat Yourself (DRY) principle in software engineering
You may find yourself writing the same code multiple times
DRY == Encapsulate common code into your own functions
# 100 normally distributed samples, mean=20, stdev=10 my_vals <- rnorm(100,mean=20,sd=10) # standardize my_vals_norm <- (my_vals - mean(my_vals))/sd(my_vals) mean(my_vals_norm) [1] 0 sd(my_vals_norm) [1] 1
If you find yourself copying and pasting code from one part of your script to another, you are repeating yourself!
R allows you to define your own functions
Example that adds two arguments together:
sum_args <- function(arg1, arg2) { # code that does something with arg1, arg2, etc some_result <- arg1 + arg2 return(some_result) # explicit return } sum_args <- function(arg1, arg2) { # code that does something with arg1, arg2, etc arg1 + arg2 # implicit return is last expression in function }
You choose function name, arguments, implementation
Example for standardize()
function:
standardize <- function(x) { res <- (x - mean(x))/sd(x) return(res) } my_vals <- rnorm(100,mean=20,sd=10) my_vals_std <- standardize(my_vals) c(mean(my_vals_std), sd(my_vals_std)) [1] 0 1 my_other_vals <- rnorm(100,mean=40,sd=5) my_other_vals_std <- standardize(my_other_vals) c(mean(my_other_vals_std), sd(my_other_vals_std)) [1] 0 1
The return()
function is not strictly necessary in R
standardize <- function(x) { res <- (x - mean(x))/sd(x) return(res) }
The result of the last line of code executed in the body of a function is returned by default
However, to again to borrow from the Zen of Python:
“Explicit is better than implicit.”
Being explicit about what a function returns by using the return()
function will make your code less error prone and easier to understand
x <- 3 multiply_x_by_two <- function() { # x is defined outside the function # but is inside this function's scope y <- x*2 return(y) }
x
and multiply_x_by_two
have universal scopey
scope is limited to within the multiply_x_by_two
functionThe multiply_x_by_two
function accesses x
from the outer scope
x <- 3 multiply_x_by_two <- function() { # x is defined outside the function # but is inside this function's scope y <- x*2 return(y) }
In general, accessing variables within functions from outside the function’s scope is very bad practice!
Functions should be as self contained as possible, any values they need should be passed as parameters
This is better:
x <- 3 multiply_by_two <- function(x) { # x is now bound to the formal argument to the function # not x in the outer scope y <- x*2 y }
for
and while
loops in python/Java/Capply()
Note that R does have for
and while
loop support in the language
However, these loop structures can have poor performance
Preference should generally be given to the functional style of iteration
If you really, really want to learn how to use for loops in R, read this, but don’t say I didn’t warn you when your code slows to a crawl for unknown reasons:
R knows how to perform many operations on vectors and matrices as well as individual values
x <- c(1,2,3,4,5) x + 3 # add 3 to every element of vector x [1] 4 5 6 7 8
Equivalent python:
x = [1, 2, 3, 4, 5] # list comprehension new_x = [x_i+3 for x_i in x] # for loop new_x = [] for x_i in x: new_x.append(x_i+3)
Matrices also support element-wise operations
x_mat <- matrix(c(1,2,3,4,5,6),nrow=2,ncol=3) x_mat + 3 # add 3 to every element of matrix x_mat [,1] [,2] [,3] [1,] 4 6 8 [2,] 5 7 9 # the * operator always means element-wise x_mat * x_mat [,1] [,2] [,3] [1,] 1 9 25 [2,] 4 16 36
R also has syntax for vector-vector, matrix-vector, and matrix-matrix operations
# the %*% operator stands for matrix multiplication x_mat %*% c(1,2,3) # [ 2x3 ] * [ 3 ] [,1] [1,] 22 [2,] 28 # recall t() is the transpose function, making [ 2x3 ] * [ 3x2 ] x_mat %*% t(x_mat) # dot product [,1] [,2] [1,] 35 44 [2,] 44 56
R is optimized for vectorized computation, if you can cast your iteration into a vector or matrix multiplication, it is a good idea to do so.
\[ \bar{\mathbf{x}} = \frac{\mathbf{x} - t_r(\mathbf{x})}{s(\mathbf{x})} \]
\[ \bar{\mathbf{x}} = \frac{\mathbf{x} - t_r(\mathbf{x})}{s(\mathbf{x})} \]
There are many different ways to define the central value \(t_r(\mathbf{x})\) of a set of numbers:
Similarly for scaling strategies \(s(\mathbf{x})\):
Consider our standardization function from earlier:
standardize <- function(x) { res <- (x - mean(x))/sd(x) return(res) }
We have hard coded mean()
and sd()
as our central tendency and scale functions
We can pass these functions as parameters instead:
# note R already has a built in function named "transform" my_transform <- function(x, t_r, s) { return((x - t_r(x))/s(x)) }
With the my_transform
function:
# note R already has a built in function named "transform" my_transform <- function(x, t_r, s) { return((x - t_r(x))/s(x)) }
We can perform Z-score normalization by passing mean
and sd
as t_r
and s
, respectively:
x <- rnorm(100,mean=20,sd=10) x_zscore <- my_transform(x, mean, sd) # functions passed as arguments mean(x_zscore) [1] 0 sd(x_zscore) [1] 1
Can use any functions for t_r
and s
so long as they accept a vector of numbers as first argument:
x <- rnorm(100,mean=20,sd=10) x_transform <- my_transform(x, median, sum) median(x_transform) [1] 0 # this quantity does not have an a priori known value # (or meaning for that matter, it's just an example) sum(x_transform) [1] 0.013
We can also write our own functions and pass them to my_transform
The following scales the values of x
to have a range of \([0,1]\):
data_range <- function(x) { return(max(x) - min(x)) } # my_transform computes: (x - min(x))/(max(x) - min(x)) x_rescaled <- my_transform(x, min, data_range) min(x_rescaled) [1] 0 max(x_rescaled) [1] 1
apply()
and friendsapply()
related functions you should use:
lapply(X, FUN)
- for when you want a list returnedvapply(X, FUN, FUN.VALUE)
- for when you want a vector returnedapply(X, FUN, MARGIN)
- for when X
is 2 dimensional (e.g. a matrix)sapply()
sapply()
is also availables
implifies the result, i.e. it “guesses” what type of output you wantsapply()
!vapply()
for vectorsThe vapply()
function is used for this, with the following signature:
vapply(X, FUN, FUN.VALUE, ...)
X
is one-dimensional collection of items (i.e. a list or vector)
FUN
is the name of a function that can accept any of the items in the list
FUN.VALUE
is an example value to indicate the type of the returned vector (recall all items in a vector must have the same type)
vapply()
returns a new vector of type typeof(FUN.VALUE)
the same length as X
where each item has had the function FUN
applied to it
vapply()
exampleConsider the vectorized addition from above:
x <- c(1,2,3,4,5) x + 3 [1] 4 5 6 7 8
We can do the equivalent operation with vapply
x <- c(1,2,3,4,5) add3 <- function(i) { i + 3 } # below the 0 means we want a numeric vector back vapply(x, add3, 0) [1] 4 5 6 7 8
Recall the Z-score transformation defined earlier:
standardize <- function(x) { res <- (x - mean(x))/sd(x) return(res) }
This function operates on a single vector
We sometimes want to transform each row or column of a matrix separately
The apply()
function allows a function to be applied to either rows or columns of a 2 dimensional data structure like a matrix or data frame
apply()
functionThis is the signature of the apply
function, from the RStudio help(apply)
page:
apply(X, MARGIN, FUN, ..., simplify = TRUE)
X
is a matrix or data frame (i.e. a rectangle of numbers)
MARGIN
indicates whether function should be applied on rows (MARGIN=1
) or columns (MARGIN=2
)
FUN
is the name of a function that accepts a vector and returns either a vector or a scalar value
apply()
then executes FUN
on each row or column of X
and returns the result
apply()
examplezscore <- function(x) { return((x-mean(x))/sd(x)) } # construct matrix of 50x100 normally distributed samples x_mat <- matrix( rnorm(100*50, mean=20, sd=5), nrow=50, ncol=100 ) # z-transform the rows so that each column has mean,sd of 0,1 x_mat_zscore <- apply(x_mat, 2, zscore) # check columns of x_mat_zscore have mean close to zero with apply x_mat_zscore_means <- apply(x_mat_zscore, 2, mean) # note: due to machine precision errors, these results will not be # exactly zero, but are very close # note: the all() function returns true if all elements are TRUE all(x_mat_zscore_means<1e-15) [1] TRUE
lapply()
Functionlapply()
iterates over the elements of X
and returns a list with the result
lapply(X, FUN, ...)
Example:
x <- list( feature1=rnorm(100,mean=20,sd=10), feature2=rnorm(100,mean=50,sd=5) ) x_zscore <- lapply(x, zscore) # check that the means are close to zero x_zscore_means <- lapply(x_zscore, mean) all(x_zscore_means < 1e-15) [1] TRUE
The most basic debugging loop is:
str()
function can be helpful when in an interpreter and not in RStudioConsider the code:
calcVal <- function(x, a, arg=2) { return(sum(x*a)**2)} calc_val_2 <- function(x, a, b, arg) { res <- sum(b+a*x)**arg return(res)}
This code is inconsistent in several ways:
calcVal
camel case, calc_val_2
snake casecalcVal
is all on one line, calc_val_2
is on multiple linescalc_val_2
function body not indented, close curly brace is appended to the last linearg
argument in calcVal
isn’t used anywhere in the functionA more consistent version of this code might look like:
exponent_product <- function(x, a, offset=0, arg=2) { return(sum(offset+a*x)**arg) }
This code is much cleaner, more consistent, and easier to read.
x <- x + 1
# add 1 as a pseudocount x <- x + 1