R Syntax Basics

  • R (like all programming languages) is basically a fancy calculator:

    1 + 2 # addition
    [1] 3
    3 - 2 # subtraction
    [1] 1
    4 * 2 # multiplication
    [1] 8
    4 / 2 # division
    [1] 2
  • The [1] lines above are the output given by R when the preceding expression is run

  • Any portion of a line starting with a # is a comment and ignored by R

R Arithmetic Continued

1.234 + 2.345 - 3.5*4.9 # numbers can have decimals
[1] -13.571
1.234 + (2.345 - 3.5)*4.9 # expressions can contain parentheses
[1] -4.4255
2**2 # exponentiation
[1] 4
4**(1/2) # square root
[1] 2
9**(1/3) # cube root
[1] 3

Reading R

  • R assigns values to symbolic placeholders called variables

  • Expressions can be assigned into a variable with a name using the <- operator:

    new_var <- 1 + 2
  • Variables values are used in later execution:

    new_var - 2
    [1] 1
    another_var <- new_var * 4

Note: “<-” vs “=

  • The correct way to assign a value to a variable in R is with the <- syntax

  • Many other programming languages which use =

  • = assignment syntax does work in R:

    new_var = 2 # works, but is not common convention!
  • BUT this is considered bad practice and may cause confusion later

  • You should always use the <- syntax when assigning values to variables!

Note: . has no special meaning in R

  • the period . does not have a special meaning like it does in many other languages like python, C, javascript, etc.
  • e.g. new.var is a valid variable name just like new_var
  • It is good practice to avoid using . characters in your variable names to reduce the chances of conflicts and confusion

Basic Types of Values

  • The type of a variable refers to the kind of value it holds, e.g.
    • Number
    • Characters (string)
    • Logical (TRUE/FALSE)

Basic Types: Numeric

  • A single number e.g. 1.0 or 1e-5 for \(10^{-5}\)
  • No distinction between fractional (e.g. 1.123) and integer numbers (1)
  • TRUE/FALSE - Logical or boolean values
    • TRUE stored as the number 1
    • FALSE stored as the number 0
    • Non-zero numbers considered “true” in R, zero considered “false”
  • Inf/-Inf - Infinity - special type that indicates division by 0
  • NaN - “impossible” value for the expression 0/0
  • complex numbers - R can store complex numbers using the complex function

NA, NaN, and Inf

  • R cannot perform computations on NA, NaN, or Inf values

  • These values have an ‘infectious’ quality to them

  • When mixed in with other values, the result of the computation reverts to the first of these values encountered:

    # this how to create a vector of 4 values in R
    x <- c(1,2,3,NA)
    mean(x) # compute the mean of values that includes NA
    [1] NA
    mean(x,na.rm=TRUE) # remove NA values prior to computing mean
    [1] 2
    mean(c(1,2,3,NaN))
    [1] NaN
    mean(c(NA,NaN,1))
    [1] NA

Missingness

  • R can handle missing values
  • NA - a special value that indicates a value is missing
  • NULL. Similar to NA, NULL indicates that a vector, rather than a value, is missing
  • More on vectors later

Other Types

  • factors - Factors are a complex type used in statistical models and are covered in greater detail later
  • character data - "value". R can store character data in the form of strings. Note R does not interpret string values by default, so "1" and 1 are distinct.
  • dates and times - basic type to store dates and times (together termed a datetime, which includes both components
    • Internally, R stores datetimes as the fractional number of days since January 1, 1970, using negative numbers for earlier dates.

Data Structures

Vectors

  • Data structures in R (and other languages) are ways of storing and organizing more than one value together

  • Most basic data structure in R is a one dimensional sequence of values called a vector:

    # the c() function creates a vector
    x <- c(1,2,3)
    [1] 1 2 3
  • The vector in R has a special property that all values contained in the vector must have the same type

Vectors continued

  • When constructing a vector, R will coerce values to the most general type if it encounters values of different types:

    c(1,2,"3")
    [1] "1" "2" "3"
    c(1,2,TRUE,FALSE)
    [1] 1 2 1 0
    c(1,2,NA) # note missing values stay missing
    [1] 1 2 NA
    c("1",2,NA,NaN) # NA stays, NaN is cast to a character type
    [1] "1" "2" NA "NaN"

Vectors continued

  • vectors also have a length, which is defined as the number of elements in the vector

    x <- c(1,2,3)
    length(x)
    [1] 3

Vector operations

  • R is much more efficient at operating on vectors than individual elements separately

  • With numeric vectors, you can perform arithmetic operations on vectors of compatible size just as easily as individual values

    c(1,2) + c(3,4)
    [1] 4 6
    c(1,2) * c(3,4)
    [1] 3 8

Vector arithmetic warning!

  • R multiplies vectors of different length in a strange way:

    c(1,2) * c(3,4,5) # lengths 2 and 3 not evenly divisible!
    [1] 3 8 5
    Warning message:
    In c(1, 2) * c(3, 4, 5) :
      longer object length is not a multiple of shorter object length
  • Cycles through values in each vector until all values are used

  • Above, c(1,2) * c(3,4,5) yields: \(1*3\), \(2*4\), \(1*5\), weird!

  • When the vector lengths are evenly divisible, no warning raised:

    c(1,2) * c(3,4,5,6) # yields: 1*3 2*4 1*5 2*6
    [1] 3 8 5 12

Factors

  • Factors are objects that R uses to handle categorical variables

    • i.e. variables that can take one of a distinct set of values for each sample
  • We create a factor from a vector of character strings using the factor() function

    case_status <- factor(
      c('Disease','Disease','Disease',
        'Control','Control','Control'
      )
    )
    case_status
    [1] Disease Disease Disease Control Control Control
    Levels: Control Disease

Factors are numeric vectors

  • The distinct values in the factor are called levels

  • Internally, a factor is stored as a vector of integers where each level has the same value:

    as.numeric(case_status)
    [1] 2 2 2 1 1 1
    str(case_status)
     Factor w/ 2 levels "Control","Disease": 2 2 2 1 1 1
  • By default, R assigns integers to levels in alphanumeric order

    • e.g. "Control" is set to 1, "Disease" is set to 2

Changing Factor Level Numbers

  • You can change which levels are assigned to which number

  • The integers assigned to each level can be specified explicitly when creating the factor:

    case_status <- factor(
      c('Disease','Disease','Disease','Control','Control','Control'),
      levels=c('Disease','Control')
    )
    case_status
    [1] Disease Disease Disease Control Control Control
    Levels: Control Disease
    str(case_status)
     Factor w/ 2 levels "Disease","Control": 1 1 1 2 2 2

Character data are loaded as factors

  • The base R functions read.csv/read.table load columns with character values as factors by default

  • You may turn this off with stringsAsFactors=FALSE to read.csv()

  • You may change the level mapping of an existing factor by releveling it by passing a factor into the factor() function and specifying new levels

    str(case_status)
     Factor w/ 2 levels "Disease","Control": 1 1 1 2 2 2
    factor(case_status, levels=c("Control","Disease"))
     Factor w/ 2 levels "Control","Disease": 2 2 2 1 1 1

Matrices

  • A matrix in R is simply the 2 dimensional version of a vector
  • i.e. a rectangle of values that all have the same type, e.g. number, character, logical, etc.
  • Constructed using the vector notation described above and specifying the number of rows and columns the matrix should have
  • Instead of having a length like a vector, it has \(m \times n\) dimensions

Matrix construction example

# create a matrix with two rows and three columns containing integers
A <- matrix(c(1,2,3,4,5,6)
       nrow = 2, ncol = 3, byrow=1
      )
A
[,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
dim(A) # prints out the dimensions of the matrix, rows first
[1] 2 3

Transposing matrices

  • Matrices can be transposed from \(m \times n\) to be \(n \times m\) using the t() function

    A
    [,1] [,2] [,3]
    [1,]    1    2    3
    [2,]    4    5    6
    t(A)
    [,1] [,2]
    [1,]    1    4
    [2,]    2    5
    [3,]    3    6
    dim(t(A))
    [1] 3 2

Lists and data frames

  • Vectors and matrices have the special property that all items must be of the same type
  • Lists and data frames are data structures that do not have this requirement
  • Lists and data frames are both one dimensional sequences of values

Lists

  • Lists can be created using the list() function:

    my_list <- list(
      c(1,2,3),
      c("A","B","C","D")
    )
    my_list
    [[1]]
    [1] 1 2 3
    
    [[2]]
    [1] "A" "B" "C" "D"
    my_list[[1]] # access the first item of the list
    [1] 1 2 3
    my_list[[2]] # access the second item of the list
    [1] "A" "B" "C" "D"

List entries can have names

  • Lists can also be defined and indexed by name:

    my_list <- list(
      numbers=c(1,2,3),
      categories=c("A","B","C","D")
    )
    my_list
    $numbers
    [1] 1 2 3
    
    $categories
    [1] "A" "B" "C" "D"
    my_list$numbers # access the first item of the list
    [1] 1 2 3
    my_list$categories # access the second item of the list
    [1] "A" "B" "C" "D"

Data frames

  • Lists and data frames are the same underlying data structure
  • The elements of a data frame must all have the same length
  • The elements of a list do not need to have all the same length

Creating data frames

  • You may create a data frame with data.frame():

    my_df <- data.frame( # recall '.' has no special meaning in R
      numbers=c(1,2,3),
      categories=c("A","B","C","D")
    )
    Error in data.frame(c(1, 2, 3), c("A", "B", "C", "D")) :
      arguments imply differing number of rows: 3, 4
    my_df <- data.frame(
      numbers=c(1,2,3),
      categories=c("A","B","C")
    )
    my_df
      numbers categories
    1       1          A
    2       2          B
    3       3          C

Subsetting

Subsetting

  • Each of the data structures in R are collections of objects
  • We often would like to select a subset of elements in a collection that have certain properties
    • e.g. numeric values less than a specific threshold
  • Selecting out a subset of elements in a data structure is called subsetting
  • R provides many different methods to subset a data structure depending on the type of data structure

0- vs 1-based indexing

  • R uses 1-based indexing
    • -> the first item in any data structure is referenced with the index 1:
x <- c(3.5, 0.4, 9.1, 7.7)
x[1]
[1] 3.5
x[4]
[1] 7.7
x[0] # this always returns an empty vector
[1] numeric(0)
x[5] # accessing indices larger than the vector length returns NA
[1] NA

0- vs 1-based indexing

  • C and python use 0-based indexing
    • the first item in a data structure is referenced with the index 0:
x = [3.5, 0.4, 9.1, 7.7]
print(x[0])
3.5
print(x[3])
7.7
print(x[4]) # accessing indices larger than the list length raises error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range

Subsetting Vectors

There are six different ways to subset vectors in R. The following three are the most common:

  • Positive integers: return specific elements by integer index.
  • Negative integers: omit specific elements by their negative integer index.
  • Logical vectors: return specific elements where the indexing vector is TRUE

Subsetting Vectors: Positive Integers

  • Positive integers: return specific elements by integer index.
x <- c(3.5, 0.4, 9.1, 7.7)
x[1]
[1] 3.5
x[c(1,3)] # can subset using a vector of integers
[1] 3.5 9.1
x[c(1,1)] # can select the same element multiple times
[1] 3.5 3.5

Subsetting Vectors: Positive Integers

  • Negative integers: omit specific elements by their negative integer index.
x <- c(3.5, 0.4, 9.1, 7.7)
x[-1] # return all but the first element
[1] 0.4 9.1 7.7
x[-c(2,4)] # return the first and third element
[1] 3.5 9.1

Subsetting Vectors: Logical Vectors

  • Logical vectors: return specific elements where the indexing vector is TRUE
x <- c(3.5, 0.4, 9.1, 7.7)
x[c(TRUE,FALSE,FALSE,TRUE)]
[1] 3.5
x[c(FALSE,TRUE,TRUE,FALSE)]
[1] 0.4 9.1
x > 3
[1] TRUE FALSE TRUE TRUE
x[x > 3]
[1] 3.5 9.1 7.7

Subsetting Matrices

  • Matrices may be subset using the same methods as vectors
  • Also, because matrices also have a notion of rows and columns, they may also be subset by pairs of vectors that select either rows or columns
  • Syntax: x[<row selectors>, <column selectors>]
  • selectors may be any of the methods used to subset a vector

Subsetting Matrices

x <- matrix(1:9, nrow=3, byrow=TRUE)
x
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
x[c(1,2), c(1,2)] # first two rows, first two columns
     [,1] [,2]
[1,]    1    2
[2,]    4    5

Subsetting Matrices

x[-c(1,3), ] # leaving a selector blank selects all, second row and all columns
[1] 4 5 6
x[, c(2,3)] # all rows included, last two columns
     [,1] [,2]
[1,]    2    3
[2,]    5    6
[3,]    8    9
x[-c(1,3), c(2)] # select the second row and second column
[1] 5

Subsetting Lists

  • A list can be subset with all the same methods as a vector
l <- list(3.5, 0.4, c(9.1, 5.8), 7.7))
l[1]
[[1]]
[1] 3.5
l[c(2,3)]
[[1]]
[1] 0.4

[[2]]
[1] 9.1 5.8

Subsetting Lists

l[c(FALSE, TRUE, TRUE, FALSE)]
[[1]]
[1] 0.4

[[2]]
[1] 9.1 5.8

Note: when indexing a list with [, the result returned is always another list; we will discuss this more later.

Subsetting Lists by Name

  • Named lists may be accessed by name
l <- list(a=3.5, b=0.4, c=c(9.1, 5.8), d=7.7)
l["a"]
$a
[1] 3.5
l[c("a","c")]
$a
[1] 3.5

$c
[1] 9.1 5.8
l[c("b")]

Data frames

  • Data frames may be accessed like both vectors (and therefore lists)
  • Also like matrices, since they are rectangular by construction

[[ and $

  • [[ and $ operators are used to access single items of data structures individual value of a list or data frame and return only that value,
  • We use [[ when indexing by integer or name:
l <- list(a=3.5, b=0.4, c=c(9.1, 5.8), d=7.7)

l[1] # returns a list with a single value of 3.5
[[1]]
[1] 3.5
l[[1]] # returns 3.5
[1] 3.5
l[[3]] # returns c(9.1, 5.8)
[1] 9.1 5.8

[[ and $

Naming Data Structures

  • Vectors, matrices, lists, and data frames can have names assigned to their indices
  • For vectors and lists, these names are one-dimensional vectors of characters
  • Assignable and accessible using the names() function:
x <- c(1,2,3)
names(x) # vectors have no names by default
NULL
names(x) <- c("a","b","c")
x
a b c
1 2 3
names(x)
[1] "a" "b" "c"

Naming Lists

  • Lists names are accessible and assignable similarly to vectors:
l <- list(a=3.5, b=0.4, c=c(9.1, 5.8), d=7.7)
names(l)
[1] "a" "b" "c" "d"
names(l) <- c("f1", "f2", "f3", "f4")
l
$f1
[1] 3.5

$f2
[1] 0.4

$f3
[1] 9.1 5.8

$f4
[1] 7.7

Naming Matrices and Data frames

  • Matrices and data frames also have both rows and columns
  • They can have rownames and colnames
m <- matrix(1:9, nrow=3, byrow=T)
colnames(m) <- c("c1", "c2", "c3")
rownames(m) <- c("r1", "r2", "r3")
m
   c1 c2 c3
r1  1  2  3
r2  4  5  6
r3  7  8  9
m["r1",] # get row 1
c1 c2 c3 
 1  2  3 
m[, "c2"] # get column 2
r1 r2 r3 
 2  5  8 

Logical Tests and Comparators

Logical Tests and Comparators

  • R recognizes logical values as a distinct type

  • R provides all the conventional infix logical operators:

    1 == 1 # equality
    [1] TRUE
    1 != 1 # inequality
    [1] FALSE
    1 < 2 # less than
    [1] TRUE
    1 > 2 # greater than
    [1] FALSE
    1 <= 2 # less than or equal to
    [1] TRUE
    1 >= 2 # greater than or equal to

Logical Tests on Vectors

  • Logical tests are applied to each element of vectors:

    x <- c(1,2,3)
    x == 2
    [1] FALSE TRUE FALSE
    x < 1
    [1] FALSE FALSE FALSE
    c(1,2) == c(1,3)
    [1] TRUE FALSE
    c(1,2) != c(1,3)
    [1] FALSE TRUE
    c(1,2) == c(1,2,3)
    [1] TRUE TRUE FALSE
    Warning message:
    In c(1, 2, 3) == c(1, 2) :
      longer object length is not a multiple of shorter object length
    

Logical Tests on Vectors

  • all() function returns a single boolean value when all results are true:

    x <- c(1,2,3)
    all(x == 2)
    [1] FALSE
    all(x > 0)
    [1] TRUE

Testing the type of a variable

  • R provides many functions of the form is.X where X is some type or condition

    is.numeric(1) # is the argument numeric?
    [1] TRUE
    is.character(1) # is the argument a string?
    [1] FALSE
    is.character("ABC")
    [1] TRUE
    is.numeric(c(1,2,3)) # recall a vector has exactly one type
    [1] TRUE
    is.numeric(c(1,2,"3"))
    [1] FALSE
    is.na(c(1,2,NA))
    [1] FALSE FALSE TRUE

Functions

Functions Intro

  • A function is a symbolic representation of code
  • R provides very large number of functions for common operations
  • You can (will, and should) write your own functions
  • Functions are useful for:
    1. Making your code more concise and readable
    2. Allow you to avoid writing the same code over and over (i.e. reuse it)
    3. Allow you to systematically test pieces of your code
    4. Allow you to share your code easily with others
    5. Program using a functional programming style

Functional Programming

  • R is a functional programming language
  • Emphasizes using functions
  • Advantages of programs written in functional programming languages
    • Concise
    • Predictable
    • Provably correct
    • Performant (e.g. easily parallelizable)

Function Definitions

  • Functions usually accept and execute on different inputs

  • e.g. the mean function wouldn’t be very useful if it didn’t accept a value

    mean(c(1,2,3))
    [1] 2
  • The function must accept or allow you to pass it arguments

    # a basic function signature
    function_name(arg1, arg2) # function accepts 2 arguments

Function Terminology

# a basic function signature
function_name(arg1, arg2) # function accepts 2 arguments
  • arg1 and arg2 are arguments indicating this function accepts two arguments
  • function_name is the name of the function
  • The pattern of arguments it accepts is called the function’s signature.
  • Every function has at least one signature
  • arg1 and arg2 are positional arguments (i.e. order matters)

Most Functions Require Passed Arguments

  • Functions will raise an error if they don’t receive arguments they expect:

    mean() # compute the arithmetic mean, but of what?
    Error in mean.default() : argument "x" is missing, with no default
  • The specific arguments a function accepts is called the function signature

  • You can find details about a function signatures using the documentation, as described next

R Function Documentation

R Function Signatures

  • These are the function signatures for the mean() function:

    mean(x, ...)
    
    ## Default S3 method:
    mean(x, trim = 0, na.rm = FALSE, ...)
  • This function has two signatures

  • The names of the arguments in the signature (e.g. x) are the variable names the function uses in its code to refer to the parameters you pass

  • The named arguments in a function signature are called parameters or formal arguments

R Function Signatures Continued

mean(x, ...)

## Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)
  • The second signature of the mean function introduces two new types of syntax:
    • Default argument values - e.g. trim = 0 named arguments that have a default value if not provided explicitly.
    • Variable arguments - .... This means the mean function can accept arguments that are not explicitly listed in the signature. This syntax is called dynamic dots.

R Function Arguments

Specifying Arguments By Name

  • All function arguments can be specified by name

    # generate 100 normally distributed samples with mean 0 and
    # standard deviation 1
    my_vals <- rnorm(100,mean=0,sd=1)
    mean(my_vals)
    [1] -0.05826857
    mean(x=my_vals)
    [1] -0.05826857
  • To borrow from the Zen of Python:

  • Passing arguments with their names can help avoid bugs!

Beware of ... dynamic dots

  • The ... argument catchall can be very dangerous.

  • It allows you to provide arguments to a function that have no meaning…

  • and R will not raise an error:

    # generates 100 normally distributed samples with mean 0 and
    # standard deviation 1
    my_vals <- rnorm(100,mean=0,sd=1)
    mean(x=my_vals,tirm=0.1)
    [1] -0.05826857
  • Did you spot the mistake?

Beware of ... dynamic dots

# generates 100 normally distributed samples with mean 0 and
# standard deviation 1
my_vals <- rnorm(100,mean=0,sd=1)
mean(x=my_vals,tirm=0.1)
[1] -0.05826857
mean(x=my_vals,trim=0.1)
[1]-0.02139839

DRY: Don’t Repeat Yourself

  • Don’t Repeat Yourself (DRY) is a principle in software engineering

  • You may find yourself writing the same code multiple times

  • DRY == Encapsulate common code into your own functions

    # 100 normally distributed samples with mean 20 and standard deviation 10
    my_vals <- rnorm(100,mean=20,sd=10)
    # standardize
    my_vals_norm <- (my_vals - mean(my_vals))/sd(my_vals)
    mean(my_vals_norm)
    [1] 0
    sd(my_vals_norm)
    [1] 1
  • If you find yourself copying and pasting code from one part of your script to another, you are repeating yourself!

Writing your own functions

  • R allows you to define your own functions

  • Example that adds two arguments together:

    sum_args <- function(arg1, arg2) {
      # code that does something with arg1, arg2, etc
      some_result <- arg1 + arg2
      return(some_result) # explicit return
    }
    
    sum_args <- function(arg1, arg2) {
      # code that does something with arg1, arg2, etc
      arg1 + arg2 # implicit return is last expression in function
    }
  • You choose function name, arguments, implementation

Custom function definition example

  • Example for standardize() function:

    standardize <- function(x) {
      res <- (x - mean(x))/sd(x)
      return(res)
    }
    
    my_vals <- rnorm(100,mean=20,sd=10)
    my_vals_std <- standardize(my_vals)
    c(mean(my_vals_std), sd(my_vals_std))
    [1] 0 1
    
    my_other_vals <- rnorm(100,mean=40,sd=5)
    my_other_vals_std <- standardize(my_other_vals)
    c(mean(my_other_vals_std), sd(my_other_vals_std))
    [1] 0 1

A Note On Returning Values

  • The return() function is not strictly necessary in R

    standardize <- function(x) {
      res <- (x - mean(x))/sd(x)
      return(res)
    }
  • The result of the last line of code executed in the body of a function is returned by default

  • However, to again to borrow from the Zen of Python:

    “Explicit is better than implicit.”

  • Being explicit about what a function returns by using the return() function will make your code less error prone and easier to understand

Scope

  • In programming, every variable you define has a scope
  • A variable’s scope defines which parts of the code have access to that variable
  • A variable with universal or top-level scope can be accessed anywhere in a program
  • Variables defined inside functions can only be accessed from within that function

Scope continued

x <- 3
multiply_x_by_two <- function() {
  # x is defined outside the function
  # but is inside this function's scope
  y <- x*2
  return(y)
}
  • x and multiply_x_by_two have universal scope
  • y scope is limited to within the multiply_x_by_two function

Scope continued

  • The multiply_x_by_two function accesses x from the outer scope

    x <- 3
    multiply_x_by_two <- function() {
      # x is defined outside the function
      # but is inside this function's scope
      y <- x*2
      return(y)
    }
  • In general, accessing variables within functions from outside the function’s scope is very bad practice!

Scope continued

  • Functions should be as self contained as possible, any values they need should be passed as parameters

  • This is better:

    x <- 3
    multiply_by_two <- function(x) {
      # x is now bound to the formal argument to the function
      # not x in the outer scope
      y <- x*2
      y
    }

Iteration

  • Iteration refers to stepping sequentially through a set or collection of objects
  • In non-functional languages like python, C, etc. there are particular control structures that implement iteration, commonly called loops
  • E.g. for and while loops in python/Java/C
  • R has these features, but is designed to iterate in a functional way
  • Iteration in R should be done in two ways:
    • vectorized operations
    • functional programming with apply()

Warning about loops in R

  • Note that R does have for and while loop support in the language

  • However, these loop structures can have poor performance

  • Preference should generally be given to the functional style of iteration

    How To Avoid For Loops in R

  • If you really, really want to learn how to use for loops in R, read this, but don’t say I didn’t warn you when your code slows to a crawl for unknown reasons:

    R for Data Science - for loops

Vectorized operations

  • R knows how to perform many operations on vectors and matrices as well as individual values

    x <- c(1,2,3,4,5)
    x + 3 # add 3 to every element of vector x
    [1] 4 5 6 7 8
  • Equivalent python:

    x = [1, 2, 3, 4, 5]
    # list comprehension
    new_x = [x_i+3 for x_i in x]
    # for loop
    new_x = []
    for x_i in x:
      new_x.append(x_i+3)

Vectorized operations on matrices

  • Matrices also support element-wise operations

    x_mat <- matrix(c(1,2,3,4,5,6),nrow=2,ncol=3)
    x_mat + 3 # add 3 to every element of matrix x_mat
    [,1] [,2] [,3]
    [1,]    4    6    8
    [2,]    5    7    9
    # the * operator always means element-wise
    x_mat * x_mat
         [,1] [,2] [,3]
    [1,]    1    9   25
    [2,]    4   16   36

Linear algebra operations

  • R also has syntax for vector-vector, matrix-vector, and matrix-matrix operations

    # the %*% operator stands for matrix multiplication
    x_mat %*% c(1,2,3) # [ 2x3 ] * [ 3 ]
         [,1]
    [1,]   22
    [2,]   28
    # recall t() is the transpose function, making [ 2x3 ] * [ 3x2 ]
    x_mat %*% t(x_mat) # dot product
         [,1] [,2]
    [1,]   35   44
    [2,]   44   56
  • R is optimized for vectorized computation, if you can cast your iteration into a vector or matrix multiplication, it is a good idea to do so.

Functional programming

  • R is a functional programming language, designed around the use of functions
  • Every function can be passed as a variable just like those bound to values
  • This means functions can be passed to other functions

Mathematical example

  • Consider a general formulation of vector transformation:

\[ \bar{\mathbf{x}} = \frac{\mathbf{x} - t_r(\mathbf{x})}{s(\mathbf{x})} \]

  • \(\mathbf{x}\) is a vector of real numbers
  • \(t_r(\mathbf{x})\) is a function that takes \(\mathbf{x}\) and returns a scalar (e.g. a central tendency like an arithmetic mean)
  • \(s(\mathbf{x})\) is a function that takes \(\mathbf{x}\) and computes a scalar scaling factor
  • \(\bar{\mathbf{x}}\) is defined as a vector of the same length where each value has had some average

Mathematical example continued

\[ \bar{\mathbf{x}} = \frac{\mathbf{x} - t_r(\mathbf{x})}{s(\mathbf{x})} \]

  • There are many different ways to define the central value \(t_r(\mathbf{x})\) of a set of numbers:

  • Similarly for scaling strategies \(s(\mathbf{x})\):

    • standard deviation, rescaling factor (e.g. set data range to be between -1 and 1), scaling to unit length (all values sum to 1), and others

Passing functions as functions

  • Consider our standardization function from earlier:

    standardize <- function(x) {
      res <- (x - mean(x))/sd(x)
      return(res)
    }
  • We have hard coded mean() and sd() as our central tendency and scale functions

  • We can pass these functions as parameters instead:

    # note R already has a built in function named "transform"
    my_transform <- function(x, t_r, s) {
      return((x - t_r(x))/s(x))
    }

Passing functions as functions

  • With the my_transform function:

    # note R already has a built in function named "transform"
    my_transform <- function(x, t_r, s) {
      return((x - t_r(x))/s(x))
    }
  • We can perform Z-score normalization by passing mean and sd as t_r and s, respectively:

    x <- rnorm(100,mean=20,sd=10)
    x_zscore <- my_transform(x, mean, sd) # functions passed as arguments
    mean(x_zscore)
    [1] 0
    sd(x_zscore)
    [1] 1

Generalized Transformation Function

  • Can use any functions for t_r and s so long as they accept a vector of numbers as first argument:

    x <- rnorm(100,mean=20,sd=10)
    x_transform <- my_transform(x, median, sum)
    median(x_transform)
    [1] 0
    # this quantity does not have an a priori known value
    # (or meaning for that matter, it's just an example)
    sum(x_transform)
    [1] 0.013

Custom Transformations

  • We can also write our own functions and pass them to my_transform

  • The following scales the values of x to have a range of \([0,1]\):

    data_range <- function(x) {
      return(max(x) - min(x))
    }
    # my_transform computes: (x - min(x))/(max(x) - min(x))
    x_rescaled <- my_transform(x, min, data_range)
    min(x_rescaled)
    [1] 0
    max(x_rescaled)
    [1] 1

apply() and friends

  • Passing functions as arguments allows us to iterate over collections
  • There are three apply() related functions you should use:
    • lapply(X, FUN) - for when you want a list returned
    • vapply(X, FUN, FUN.VALUE) - for when you want a vector returned
    • apply(X, FUN, MARGIN) - for when X is 2 dimensional (e.g. a matrix)
  • Note:
    • sapply() is also available
    • this function automatically simplifies the result, i.e. it “guesses” what type of output you want
    • Can make your code unpredictable!
    • Recommend against using sapply()!

vapply() for vectors

  • The vapply() function is used for this, with the following signature:

    vapply(X, FUN, FUN.VALUE, ...)
  • X is one-dimensional collection of items (i.e. a list or vector)

  • FUN is the name of a function that can accept any of the items in the list

  • FUN.VALUE is an example value to indicate the type of the returned vector (recall all items in a vector must have the same type)

  • vapply() returns a new vector of type typeof(FUN.VALUE) the same length as X where each item has had the function FUN applied to it

vapply() example

  • Consider the vectorized addition from above:

    x <- c(1,2,3,4,5)
    x + 3
    [1] 4 5 6 7 8
  • We can do the equivalent operation with vapply

    x <- c(1,2,3,4,5)
    add3 <- function(i) {
      i + 3
    }
    # below the 0 means we want a numeric vector back
    vapply(x, add3, 0)
    [1] 4 5 6 7 8

Functional operations on 2 dimensions

  • Recall the Z-score transformation defined earlier:

    standardize <- function(x) {
      res <- (x - mean(x))/sd(x)
      return(res)
    }
  • This function operates on a single vector

  • We sometimes want to transform each row or column of a matrix separately

  • The apply() function allows a function to be applied to either rows or columns of a 2 dimensional data structure like a matrix or data frame

The apply() function

  • This is the signature of the apply function, from the RStudio help(apply) page:

    apply(X, MARGIN, FUN, ..., simplify = TRUE)
  • X is a matrix or data frame (i.e. a rectangle of numbers)

  • MARGIN indicates whether function should be applied on rows (MARGIN=1) or columns (MARGIN=2)

  • FUN is the name of a function that accepts a vector and returns either a vector or a scalar value

  • apply() then executes FUN on each row or column of X and returns the result

apply() example

zscore <- function(x) {
  return((x-mean(x))/sd(x))
}
# construct matrix of 50x100 normally distributed samples
x_mat <- matrix( rnorm(100*50, mean=20, sd=5),
  nrow=50,
  ncol=100
)
# z-transform the rows so that each column has mean,sd of 0,1
x_mat_zscore  <- apply(x_mat, 2, zscore)
# check columns of x_mat_zscore have mean close to zero with apply
x_mat_zscore_means <- apply(x_mat_zscore, 2, mean)
# note: due to machine precision errors, these results will not be
# exactly zero, but are very close
# note: the all() function returns true if all elements are TRUE
all(x_mat_zscore_means<1e-15)
[1] TRUE

lapply() Function

  • lapply() iterates over the elements of X and returns a list with the result

    lapply(X, FUN, ...)
  • Example:

    x <- list(
      feature1=rnorm(100,mean=20,sd=10),
      feature2=rnorm(100,mean=50,sd=5)
    )
    x_zscore <- lapply(x, zscore)
    # check that the means are close to zero
    x_zscore_means <- lapply(x_zscore, mean)
    all(x_zscore_means < 1e-15)
    [1] TRUE

Installing Packages

  • Advanced functionality in R is provided through packages written and supported by R community members

  • With the exception of bioconductor packages, all R packages are hosted on the Comprehensive R Archive Network (CRAN) web site

  • There are more than 18,000 packages hosted on CRAN

  • To install a package from CRAN, use the install.packages function in the R console:

    # install one package
    install.packages("tidyverse")
    # install multiple packages
    install.packages(c("readr","dplyr"))

Saving and Loading R Data

  • Matrices and data frames can be written to file tabular form (e.g. CSV files)
  • Sometimes it is convenient to save complicated R objects and data structures like lists to a file that can be read back into R easily
  • saveRDS() and readRDS() functions do this

saveRDS() and readRDS example

a_complicated_list <- list(
    categories = c("A","B","C"),
    data_matrix = matrix(c(1,2,3,4,5,6),nrows=2,ncols=3),
    nested_list = list(
      a = c(1,2,3),
      b = c(4,5,6)
    )
)
saveRDS(a_complicated_list, "a_complicated_list.rda")

# later, possibly in a different script
a_complicated_list <- readRDS("a_complicated_list.rda")

Troubleshooting and Debugging

Troubleshooting and Debugging

  • Bugs in code are normal
  • Two kinds of bugs:
    • Syntax errors - code will not run, R will tell you about it
    • Logic errors - code does run, but produces incorrect results
  • You are not a bad programmer if your code has bugs
  • Some bugs can be very difficult to fix, and some are even difficult to find
  • You will spend a substantial amount of time debugging your code in R, especially as you are learning
  • Be patient with yourself and others

Finding questions and answers

  • “It’s always ok to ask for help, but it’s always to your advantage to figure it out yourself.”
  • You will encounter R error and warning messages routinely during development, and not all of them are straightforward to understand.
  • It is important that you learn how to seek the answers to the problems R reports on your own
  • Your colleagues (and instructors!) will thank you for it.

Debugging Strategies

Debugging strategy 1: Google!

  • Copy and paste the error into google and see what comes back
  • Especially when starting out, the errors you receive have been encountered countless times by others before you
  • Solutions/explanations of them are already out there
  • If you aren’t already familiar with Stack Overflow, you will be very soon

Debugging strategy 2: Make it repeatable

  • When you encounter an error, don’t change anything in your code right away
  • Try again to make sure you get the same error again
  • This may require you to isolate the code with the error in a different setting to make it easier to run
  • If you do, this means the error is repeatable, or replicable, and you can now try modifying the code in question to see if and how the error changes.

Debugging strategy 3: Where is the bug?

  • Finding out where the bug is can be hard!
  • Most bugs involve multiple lines of code,
  • Only a subset of which contains the actual error
  • Sometimes the exact line where the error occurs is obvious, but other times the error is a consequence of a mistake assumption made earlier in the code.

Debugging strategy 4: Fix it, test it

  • When you have identified the specific issue causing the bug, modify the code so it produces the correct result
  • Then rigorously test your fix to make sure it is correct
  • Sometimes making one change to code causes side effects elsewhere in your code in ways that are difficult to predict
  • Ideally, you have already written unit tests that explicitly test parts of your code
  • If not you will need to use other means of convincing yourself that your fix worked.

Debugging tools in RStudio

  • The most basic debugging loop is:

    1. Write code
    2. Run code
    3. Print out results
    4. Compare to expected result
    5. Go to 1
  • RStudio, the Environment Inspector in the top right of the interface makes inspecting the current values of your variables very easy

  • You can also easily execute lines of code from your script in the interpreter at the bottom right

  • The str() function can be helpful when in an interpreter and not in RStudio

  • RStudio has many more debugging tools you can use

  • Check out the section on Debugging in Hadley Wickam’s Advanced R book

Coding Style and Conventions

  • Common worries I get from students:
    • “Is my code terrible?” *“How do I write good code?”
  • There is no gold standard for what makes code “good”
  • BUT there are some questions you can ask of your code as a guide

Is my code correct?

  • Does it produce the desired output?
  • It can be harder to be sure of this than you might think, especially as your code becomes more complicated
  • Simple trial and error is an effective first approach
  • A more reliable albeit time- and thought-intensive strategy is to write explicit tests for your code and run them regularly
  • The homework assignments use explicit tests

Does my code follow the DRY principle

  • Don’t Repeat Yourself (DRY) is a powerful and helpful strategy to make your code more reliable
  • This typically involves identifying common patterns in your code and moving them to functions or objects

Did I choose concise but descriptive variable and function names?

  • Variable and function names should be descriptive when necessary and not too long
  • Try to put yourself in the shoes of someone who is reading your code for the first time
  • Can you can figure out what it does?
  • Better yet, offer to buy a friend food/a beverage in return for them looking at it!

Coding Convention Consistency

  • Did I use indentation and naming conventions consistently throughout my code?
  • Consistently formatted code is much easier to read (and possibly understand) than inconsistent code

Poor Consistency Example

  • Consider the code:

    calcVal <- function(x, a, arg=2) { return(sum(x*a)**2)}
    calc_val_2 <- function(x, a, b, arg) {
    res <- sum(b+a*x)**arg
    return(res)}

Issues With This Code

This code is inconsistent in several ways:

  • naming conventions - calcVal camel case, calc_val_2 snake case
  • new lines and whitespace - calcVal is all on one line, calc_val_2 is on multiple lines
  • unhelpful indentation - calc_val_2 function body not indented, close curly brace is appended to the last line
  • unhelpful function and argument names - the function/parameter names don’t describe what they do/mean
  • unused function arguments - the arg argument in calcVal isn’t used anywhere in the function
  • the two functions appear to do something very similar and could be made simpler using a default argument

Better Consistency Example

A more consistent version of this code might look like:

exponent_product <- function(x, a, offset=0, arg=2) {
  return(sum(offset+a*x)**arg)
}

This code is much cleaner, more consistent, and easier to read.

Did I write comments, especially when what the code does is not obvious?

  • Sometimes what a piece of code does is obvious from looking at it:

    x <- x + 1
  • However, it may not be obvious why a piece of code does what it does

  • Consider recording your thinking about a line of code as a comment:

# add 1 as a pseudocount
x <- x + 1
  • Then when you or someone else reads the code, it will be obvious what you were thinking when you wrote it
  • You will encounter situations where you need to figure out what you yourself were thinking when you wrote a piece of code
  • Endeavor to make future you proud of current you!

How easy would it be for someone else to understand my code?

  • If someone else who has never seen my code before is asked to run and understand it…
  • How easy would it be for them to do so?

Is my code easy to maintain/change?

  • Well written code is easier to understand
  • Code that is easy to understand is easier to modify than abstruse code
  • You will gain a sense for this over time

The styler package

  • Consistently formatted code is generally much easier to read than inconsistently formatted code

  • Consistent formatting may also allow you to identify syntax and logic errors much more easily than it might be otherwise

  • The styler package is an R package that can automatically format your code to make it consistent in a number of ways

  • When you install styler with install.packages("styler") in RStudio, a new entry is available in the Addins menu:

  • These Addins allow you to let styler format your code for you according to some reasonable (albeit arbitrary) conventions.