Course Introduction

Biology as Data Science

The sequencing of the human genome ushered in the “post-genome” biological revolution

Biology is now a data science

Domains of Biological Data Analysis

Modern biological data analysis requires skills and knowledge from many domains:

  • molecular biology, genetics, biochemistry
  • statistics, mathematics
  • computer science
  • programming & software engineering
  • data visualization
  • high performance & cloud computing

No one person can be expert in all these areas!

Experts create tools and techniques that we can use.

The R Programming Language

  • R is a statistical programming language
  • https://www.r-project.org/
  • Designed to conduct statistical analyses and visualize data
  • NOT a general purpose programming language!

tidyverse

Book & Course Objectives

Learn R and its related packages to analyze biological data

Communicate results of R analyses with effective visualizations and notebooks

Learn how to use the RStudio development environment

Write correct and reproducible code using formal testing strategies

Share analyses with others using RShiny applications

Course Topics

Who This Book Is For

You are: a practicing biologist wishing to learn how to use R to analyze biological data

We assume a basic working knowledge of:

  • genetics
  • genomics
  • molecular biology
  • biochemistry
  • statistics

We endeavor to explain required background whenever possible to relax these assumptions

Sources and References

R Materials

Sources and References

Data visualization

Course Structure & Assignments

Course Structure

  • Weekly lectures

  • 7 assignments, roughly one per week

  • Final project: RShiny application combining the techniques you learned in the assignments

  • Grading:

    • Assignments 5% each (35% total)
    • Final project 60%
    • Class attendance/participation 5%
  • Zoom link will be provided in email / blackboard

Things that are more important than this course

  • Your physical, emotional and mental health

  • Your family and friends

  • Policy on absences / missed classes

    • You never need to disclose anything private to me, just let me know if you will be absent for an extended period of time or need an extension, and I will work with you to accommodate your situation.

Assignments

  • Detailed instructions are in the book

  • Assignments hosted on GitHub

  • We will use GitHub Classroom to make them available

  • More on this later

Data in Biology

Introduction

  • Molecular biology became a data science in 1953 with the determination of the DNA chemical structure
  • Digital computer technology and data storage technologies developed around the same time
  • Amino acid first molecules sequenced, followed by nucleotides
  • bioinformatics defined by Pauline Hogeweg and Ben Hesper in 1970: “the study of informatic processes in biotic systems”
  • Expanded in early 1980s as sequence data volume grew
  • Human genome project was formally launched in 1990

Biological Data Timeline

Biological Data Timeline

Human Genome Era

The Biologist’s Tools

Preliminaries

The R Language

  • R is a free, open source programming language and computing environment
  • R is a statistical programming language
    • not a general purpose programming language (like python)!
  • Designed for data manipulation, analysis, and visualization
  • Download for free from Comprehensive R Archive Network (CRAN)

Effective bioinformatics

  • Many tools are required in a complete bioinformatics analysis

    • General purpose languages e.g. python
    • Purpose specific languages e.g. R, nextflow/snakemake
    • File/text manipulation tools e.g. cat, grep, etc
    • Bioinformatics specific tools e.g. aligners (bowtie, STAR)
    • Scripting languages a.k.a. “glue code” e.g. bash
  • Learning which tools are appropriate for which steps in an analysis are learned with time!

RStudio

RStudio Basic Interface

Turn Off Environment Restore in RStudio!

  • By default, RStudio preserves your R environment when you shut it down and restores it when you start it again.

  • This is very bad practice! Turn it off right away!

  • Open the Tools > Global Options… menu and:

    1. Uncheck “Restore .RData into workspace at startup”
    2. Set “Save workspace to .RData on exit:” to “Never”

Turn Off Environment Restore!

The R Script

  • R is both a programming language and a computing environment

  • A R program contains many lines of code that are executed in sequence

  • A R script is a file that contains lines of R code

  • Name your R script names to be descriptive but concise:

    • Bad: do_it.R
    • Better but still bad: a_script_with_very_cool_analysis_and_plots.R
    • Better: analyze_gene_expression.R

Create a new R Script in RStudio

  • File -> New File -> R Script

The Scripting Workflow

  • Common workflow:

    1. Write and save code into your R script
    2. Execute the lines of code you are working on with:
      • Control-Enter on Windows
      • Command-Enter on Mac
    3. Code will execute in Console window, where you may execute other commands to examine results
    4. You may inspect the variables you have defined in the Environment pane
    5. Make adjustments to your R script if needed
    6. Go to step 1

Example script

# stores the result of 1+1 into a variable named 'a'
a <- 1+1

The Scripting Workflow

Rmarkdown & knitr

Communicating with R

  • All research results must be communicated at some point

  • Results are encoded in text, tables, and plots

  • Methods are described in text, and (ideally) the code itself

    • Many journals now require code to be made available upon publication
  • Providing code can make analyses more reproducible

  • Code notebooks are tools that combine code, results, and text into one place

Markup langauges & markdown

  • A markup language annotates and decorates plain text with information about its formatting and structure

  • Both machine- and human-readable

  • The same markup text might be converted into different formats, e.g. HTML or PDF

  • markdown is one such markup language

Markdown Examples

You can *emphasize* text, or **really emphasize it**.

Lists are pretty easy to read as well:

* item 1
* item 2
* item 3

You can emphasize text, or really emphasize it.

Lists are pretty easy to read as well:

  • item 1
  • item 2
  • item 3

Markdown Examples

If you need an enumerated list you can do that too:

1. item 1
2. item 2
3. item 3

If you need an enumerated list you can do that too:

  1. item 1
  2. item 2
  3. item 3

Markdown Examples

You can easily include links to web sites like
[Google](http://google.com) and images:
![kitty](https://upload.wikimedia.org/wikipedia/commons/b/bc/
  Juvenile_Ragdoll.jpg)
You can easily include links to web sites like Google and images:

Markdown Tables

+------+------+-----+-------+
| some | data | and | stuff |
+======+======+=====+=======+
|   A  |   1  |  2  |   3   |
+------+------+-----+-------+
|   B  |   4  |  5  |   6   |
+------+------+-----+-------+
some data and stuff
A 1 2 3
B 4 5 6

RMarkdown

  • RMarkdown is an extension of markdown that works in R
  • Can include code blocks in R in between markdown formatted text that execute
  • Files with RMarkdown are called RMarkdown notebooks
  • RMarkdown files typically end with .Rmd
  • RStudio has full RMarkdown integration to make writing RMarkdown notebooks very easy

R code blocks in RMarkdown

  • Lines starting with ```{r} define a special code block

  • Example:

    
    ``` r
    a <- 1
    ```
    
  • R code in the block will be executed

  • Code output placed below the block in the report

  • RStudio shows you the output of the notebook within its interface

RMarkdown in RStudio

R code block output

knitr

  • The knitr R package turns RMarkdown into a report
  • The process is called “knitting”
  • Same report can be knitted into many different formats
    • HTML
    • PDF
    • Microsoft Word
    • Slides (like these ones)

git + github

git motivation

  • Code you write changes over time

  • We would like to maintain previous versions of code, in case new changes break the code

  • Simple approach: make copies of scripts and rename them:

    • Original script: my_R_script.R
    • New analysis is required: my_R_script_v2.R
    • Found bug in new analysis: my_R_script_v2_BAD.R
    • Script with fixed bug: my_R_script_v2_final.R
    • Changes after manuscript revision: my_R_script_v2_final_revision.R
    • And so on…

A note on bugs…

  • A bug is a piece of code that produces an incorrect or undesirable result

  • Bugs are normal! You’re not a bad programmer if your code has bugs.

  • There are two types of bugs:

    1. Syntax bugs: bugs due to incorrect language usage, which R will tell you about and can (usually) be easily identified and fixed
    2. Logic bugs: the code you write is syntactically correct, but does something other than what you intend

Enter git

  • We want to preserve old versions of code
  • We don’t want to clutter our directories with out dated files
  • git is a free, open source version control software program.
  • Version control software is used to track and record changes to code over time
  • Many developers may work on the same software project concurrently from different parts of the world
  • Can be used on the command line, or with graphical user interface applications for popular operating systems.

git concepts

  • repository (or repo): a collection of files in a directory that you have asked git to track (run git init in a new directory)
  • Each file you wish to track must be explicitly added to the repo
  • When you modify a tracked file, git will notice those differences
  • Tracked changes by making a commit
    • A commit takes a snapshot of all the tracked files in the repo at the time the commit is made commit message that describes what was done)
  • Each commit has a date and time associated with it. The files in the repo can be reset to exactly the state they were in at any commit, thus preserving all previous versions of code.

Basic git workflow

At the beginning, create new repo git init . or with graphical interface

  1. Create/modify files
  2. Add changes with git add <filename>...
  3. Check changes with git status, verify changes are as intended
  4. Commit changes with git commit -m "<commit message"
  5. Go to 1

Git hosting platforms (GitHub)

  • git software only works on your local computer with local repositories
  • To share this code with others, and receive others’ contributions, repo must be made available in a centralized location
  • github.com is a free web application that hosts git repos
  • Note: There is no formal relationship between git and GitHub - the only connection between GitHub and git is that GitHub hosts git repos.

GitHub Basics

  • First you must have an account on GitHub

  • Create a new repo on GitHub

  • Then:

    • If you do not already have a local git repo: clone your GitHub repo to create a local copy connected to GitHub
    • If you already have a local git repo: Follow the instructions on the GitHub to connect your local repo to the GitHub as a remote
  • Now your local repo is connected to the same repo on GitHub

  • Changes you make to your local files can be sent, or pushed to the repo on GitHub

git+GitHub Workflow

  1. Get (pull) any changes from GitHub with git pull
  2. Create/modify files
  3. Add changes with git add <filename>...
  4. Check changes with git status, verify changes are as intended
  5. Commit changes with git commit -m "<commit message"
  6. Send (push) changes to GitHub with git push
  7. Go to 1

R Programming

R Syntax Basics

  • R (like all programming languages) is basically a fancy calculator:

    1 + 2 # addition
    [1] 3
    3 - 2 # subtraction
    [1] 1
    4 * 2 # multiplication
    [1] 8
    4 / 2 # division
    [1] 2
  • The [1] lines above are the output given by R when the preceding expression is run

  • Any portion of a line starting with a # is a comment and ignored by R

R Arithmetic Continued

1.234 + 2.345 - 3.5*4.9 # numbers can have decimals
[1] -13.571
1.234 + (2.345 - 3.5)*4.9 # expressions can contain parentheses
[1] -4.4255
2**2 # exponentiation
[1] 4
4**(1/2) # square root
[1] 2
9**(1/3) # cube root
[1] 3

Reading R

  • R assigns values to symbolic placeholders called variables

  • Expressions can be assigned into a variable with a name using the <- operator:

    new_var <- 1 + 2
  • Variables values are used in later execution:

    new_var - 2
    [1] 1
    another_var <- new_var * 4

Note: “<-” vs “=

  • The correct way to assign a value to a variable in R is with the <- syntax

  • Many other programming languages use =

  • = assignment syntax does work in R:

    new_var = 2 # works, but is not common convention!
  • BUT this is considered bad practice and may cause confusion later

  • You should always use the <- syntax when assigning values to variables!

Note: . has no special meaning in R

  • the period . does not have a special meaning like it does in many other languages like python, C, javascript, etc.
  • e.g. new.var is a valid variable name just like new_var
  • It is good practice to avoid using . characters in your variable names to reduce the chances of conflicts and confusion

Basic Types of Values

  • The type of a variable refers to the kind of value it holds, e.g.
    • Number
    • Characters (string)
    • Logical (TRUE/FALSE)

Basic Types: Numeric

  • A single number e.g. 1.0 or 1e-5 for \(10^{-5}\)
  • No distinction between fractional (e.g. 1.123) and integer numbers (1)
  • TRUE/FALSE - Logical or boolean values
    • TRUE stored as the number 1
    • FALSE stored as the number 0
    • Non-zero numbers considered “true” in R, zero considered “false”
  • Inf/-Inf - Infinity - special type that indicates division by 0
  • NaN - “impossible” value for the expression 0/0
  • complex numbers - R can store complex numbers using the complex function

Missingness

  • R can handle missing values
  • NA - a special value that indicates a value is missing
  • NULL. Similar to NA, NULL indicates that a vector, rather than a value, is missing
  • More on vectors later

NA, NaN, and Inf

  • R cannot perform computations on NA, NaN, or Inf values

  • These values have an ‘infectious’ quality to them

  • When mixed in with other values, the result of the computation reverts to the first of these values encountered:

    # this how to create a vector of 4 values in R
    x <- c(1,2,3,NA)
    mean(x) # compute the mean of values that includes NA
    [1] NA
    mean(x,na.rm=TRUE) # remove NA values prior to computing mean
    [1] 2
    mean(c(1,2,3,NaN))
    [1] NaN
    mean(c(NA,NaN,1))
    [1] NA

Other Types

  • factors - Factors are a complex type used in statistical models and are covered in greater detail later
  • character data - "value". R can store character data in the form of strings. Note R does not interpret string values by default, so "1" and 1 are distinct.
  • dates and times - basic type to store dates and times (together termed a datetime, which includes both components
    • Internally, R stores datetimes as the fractional number of days since January 1, 1970, using negative numbers for earlier dates.

Data Structures

Vectors

  • Data structures in R (and other languages) are ways of storing and organizing more than one value together

  • Most basic data structure in R is a one dimensional sequence of values called a vector:

    # the c() function creates a vector
    x <- c(1,2,3)
    [1] 1 2 3
  • The vector in R has a special property that all values contained in the vector must have the same type

Vectors continued

  • When constructing a vector, R will coerce values to the most general type if it encounters values of different types:

    c(1,2,"3")
    [1] "1" "2" "3"
    c(1,2,TRUE,FALSE)
    [1] 1 2 1 0
    c(1,2,NA) # note missing values stay missing
    [1] 1 2 NA
    c("1",2,NA,NaN) # NA stays, NaN is cast to a character type
    [1] "1" "2" NA "NaN"

Vectors continued

  • vectors also have a length, which is defined as the number of elements in the vector

    x <- c(1,2,3)
    length(x)
    [1] 3

Vector operations

  • R is much more efficient at operating on vectors than individual elements separately

  • With numeric vectors, you can perform arithmetic operations on vectors of compatible size just as easily as individual values

    c(1,2) + c(3,4)
    [1] 4 6
    c(1,2) * c(3,4)
    [1] 3 8

Vector arithmetic warning!

  • R multiplies vectors of different length in a strange way:

    c(1,2) * c(3,4,5) # lengths 2 and 3 not evenly divisible!
    [1] 3 8 5
    Warning message:
    In c(1, 2) * c(3, 4, 5) :
      longer object length is not a multiple of shorter object length
  • Cycles through values in each vector until all values are used

  • Above, c(1,2) * c(3,4,5) yields: \(1*3\), \(2*4\), \(1*5\), weird!

  • When the vector lengths are evenly divisible, no warning raised:

    c(1,2) * c(3,4,5,6) # yields: 1*3 2*4 1*5 2*6
    [1] 3 8 5 12

Factors

  • Factors are objects that R uses to handle categorical variables

    • i.e. variables that can take one of a distinct set of values for each sample
  • We create a factor from a vector of character strings using the factor() function

    case_status <- factor(
      c('Disease','Disease','Disease',
        'Control','Control','Control'
      )
    )
    case_status
    [1] Disease Disease Disease Control Control Control
    Levels: Control Disease

Factors are numeric vectors

  • The distinct values in the factor are called levels

  • Internally, a factor is stored as a vector of integers where each level has the same value:

    as.numeric(case_status)
    [1] 2 2 2 1 1 1
    str(case_status)
     Factor w/ 2 levels "Control","Disease": 2 2 2 1 1 1
  • By default, R assigns integers to levels in alphanumeric order

    • e.g. "Control" is set to 1, "Disease" is set to 2

Changing Factor Level Numbers

  • You can change which levels are assigned to which number

  • The integers assigned to each level can be specified explicitly when creating the factor:

    case_status <- factor(
      c('Disease','Disease','Disease','Control','Control','Control'),
      levels=c('Disease','Control')
    )
    case_status
    [1] Disease Disease Disease Control Control Control
    Levels: Control Disease
    str(case_status)
     Factor w/ 2 levels "Disease","Control": 1 1 1 2 2 2

Character data are loaded as factors

  • The base R functions read.csv/read.table load columns with character values as factors by default

  • You may turn this off with stringsAsFactors=FALSE to read.csv()

  • You may change the level mapping of an existing factor by releveling it by passing a factor into the factor() function and specifying new levels

    str(case_status)
     Factor w/ 2 levels "Disease","Control": 1 1 1 2 2 2
    factor(case_status, levels=c("Control","Disease"))
     Factor w/ 2 levels "Control","Disease": 2 2 2 1 1 1

Matrices

  • A matrix in R is simply the 2 dimensional version of a vector
  • i.e. a rectangle of values that all have the same type, e.g. number, character, logical, etc.
  • Constructed using the vector notation described above and specifying the number of rows and columns the matrix should have
  • Instead of having a length like a vector, it has \(m \times n\) dimensions

Matrix construction example

# create a matrix with two rows and three columns containing integers
A <- matrix(c(1,2,3,4,5,6)
       nrow = 2, ncol = 3, byrow=1
      )
A
[,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
dim(A) # prints out the dimensions of the matrix, rows first
[1] 2 3

Transposing matrices

  • Matrices can be transposed from \(m \times n\) to be \(n \times m\) using the t() function

    A
    [,1] [,2] [,3]
    [1,]    1    2    3
    [2,]    4    5    6
    t(A)
    [,1] [,2]
    [1,]    1    4
    [2,]    2    5
    [3,]    3    6
    dim(t(A))
    [1] 3 2

Lists and data frames

  • Vectors and matrices have the special property that all items must be of the same type
  • Lists and data frames are data structures that do not have this requirement
  • Lists and data frames are both one dimensional sequences of values

Lists

  • Lists can be created using the list() function:

    my_list <- list(
      c(1,2,3),
      c("A","B","C","D")
    )
    my_list
    [[1]]
    [1] 1 2 3
    
    [[2]]
    [1] "A" "B" "C" "D"
    my_list[[1]] # access the first item of the list
    [1] 1 2 3
    my_list[[2]] # access the second item of the list
    [1] "A" "B" "C" "D"

List entries can have names

  • Lists can also be defined and indexed by name:

    my_list <- list(
      numbers=c(1,2,3),
      categories=c("A","B","C","D")
    )
    my_list
    $numbers
    [1] 1 2 3
    
    $categories
    [1] "A" "B" "C" "D"
    my_list$numbers # access the first item of the list
    [1] 1 2 3
    my_list$categories # access the second item of the list
    [1] "A" "B" "C" "D"

Data frames

  • Lists and data frames are the same underlying data structure
  • The elements of a data frame must all have the same length
  • The elements of a list do not need to have all the same length

Creating data frames

  • You may create a data frame with data.frame():

    my_df <- data.frame( # recall '.' has no special meaning in R
      numbers=c(1,2,3),
      categories=c("A","B","C","D")
    )
    Error in data.frame(c(1, 2, 3), c("A", "B", "C", "D")) :
      arguments imply differing number of rows: 3, 4
    my_df <- data.frame(
      numbers=c(1,2,3),
      categories=c("A","B","C")
    )
    my_df
      numbers categories
    1       1          A
    2       2          B
    3       3          C

Naming Data Structures

Naming Data Structures

  • Vectors, matrices, lists, and data frames can have names assigned to their indices
  • For vectors and lists, these names are one-dimensional vectors of characters
  • Assignable and accessible using the names() function:
x <- c(1,2,3)
names(x) # vectors have no names by default
NULL
names(x) <- c("a","b","c")
x
a b c
1 2 3
names(x)
[1] "a" "b" "c"

Naming Lists

  • Lists names are accessible and assignable similarly to vectors:
l <- list(a=3.5, b=0.4, c=c(9.1, 5.8), d=7.7)
names(l)
[1] "a" "b" "c" "d"
names(l) <- c("f1", "f2", "f3", "f4")
l
$f1
[1] 3.5

$f2
[1] 0.4

$f3
[1] 9.1 5.8

$f4
[1] 7.7

Naming Matrices and Data frames

  • Matrices and data frames also have both rows and columns
  • They can have rownames and colnames
m <- matrix(1:9, nrow=3, byrow=T)
colnames(m) <- c("c1", "c2", "c3")
rownames(m) <- c("r1", "r2", "r3")
m
   c1 c2 c3
r1  1  2  3
r2  4  5  6
r3  7  8  9
m["r1",] # get row 1
c1 c2 c3 
 1  2  3 
m[, "c2"] # get column 2
r1 r2 r3 
 2  5  8 

Subsetting

Subsetting

  • Each of the data structures in R are collections of objects
  • We often would like to select a subset of elements in a collection that have certain properties
    • e.g. numeric values less than a specific threshold
  • Selecting out a subset of elements in a data structure is called subsetting
  • R provides many different methods to subset a data structure depending on the type of data structure

0- vs 1-based indexing

  • R uses 1-based indexing
  • First item in any data structure is referenced with the index 1:
x <- c(3.5, 0.4, 9.1, 7.7)
x[1]
[1] 3.5
x[4]
[1] 7.7
x[0] # this always returns an empty vector
[1] numeric(0)
x[5] # accessing indices larger than the vector length returns NA
[1] NA

0- vs 1-based indexing

  • C and python use 0-based indexing
  • First item in a data structure is referenced with the index 0:
x = [3.5, 0.4, 9.1, 7.7]
print(x[0])
3.5
print(x[3])
7.7
print(x[4]) # accessing indices larger than the list length raises error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range

Subsetting Vectors

There are six different ways to subset vectors in R. The following three are the most common:

  • Positive integers: return specific elements by integer index.
  • Negative integers: omit specific elements by their negative integer index.
  • Logical vectors: return specific elements where the indexing vector is TRUE

Subsetting Vectors: Positive Integers

  • Positive integers: return specific elements by integer index.
x <- c(3.5, 0.4, 9.1, 7.7)
x[1]
[1] 3.5
x[c(1,3)] # can subset using a vector of integers
[1] 3.5 9.1
x[c(1,1)] # can select the same element multiple times
[1] 3.5 3.5

Subsetting Vectors: Positive Integers

  • Negative integers: omit specific elements by their negative integer index.
x <- c(3.5, 0.4, 9.1, 7.7)
x[-1] # return all but the first element
[1] 0.4 9.1 7.7
x[-c(2,4)] # return the first and third element
[1] 3.5 9.1

Subsetting Vectors: Logical Vectors

  • Logical vectors: return specific elements where the indexing vector is TRUE
x <- c(3.5, 0.4, 9.1, 7.7)
x[c(TRUE,FALSE,FALSE,TRUE)]
[1] 3.5
x[c(FALSE,TRUE,TRUE,FALSE)]
[1] 0.4 9.1
x > 3
[1] TRUE FALSE TRUE TRUE
x[x > 3]
[1] 3.5 9.1 7.7

Subsetting Matrices

  • Matrices may be subset using the same methods as vectors
  • Also, because matrices also have a notion of rows and columns, they may also be subset by pairs of vectors that select either rows or columns
  • Syntax: x[<row selectors>, <column selectors>]
  • selectors may be any of the methods used to subset a vector

Subsetting Matrices

x <- matrix(1:9, nrow=3, byrow=TRUE)
x
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
x[c(1,2), c(1,2)] # first two rows, first two columns
     [,1] [,2]
[1,]    1    2
[2,]    4    5

Subsetting Matrices

x[-c(1,3), ] # leaving a selector blank selects all, second row and all columns
[1] 4 5 6
x[, c(2,3)] # all rows included, last two columns
     [,1] [,2]
[1,]    2    3
[2,]    5    6
[3,]    8    9
x[-c(1,3), c(2)] # select the second row and second column
[1] 5

Subsetting Lists

  • A list can be subset with all the same methods as a vector
l <- list(3.5, 0.4, c(9.1, 5.8), 7.7))
l[1]
[[1]]
[1] 3.5
l[c(2,3)]
[[1]]
[1] 0.4

[[2]]
[1] 9.1 5.8

Subsetting Lists

l[c(FALSE, TRUE, TRUE, FALSE)]
[[1]]
[1] 0.4

[[2]]
[1] 9.1 5.8

Note: when indexing a list with [, the result returned is always another list; we will discuss this more later.

Subsetting Lists by Name

  • Named lists may be accessed by name
l <- list(a=3.5, b=0.4, c=c(9.1, 5.8), d=7.7)
l["a"]
$a
[1] 3.5
l[c("a","c")]
$a
[1] 3.5

$c
[1] 9.1 5.8
l[c("b")]

Data frames

  • Data frames may be accessed like both vectors (and therefore lists)
  • Also like matrices, since they are rectangular by construction

[[ and $

  • [[ and $ operators are used to access single items of data structures individual value of a list or data frame and return only that value,
  • We use [[ when indexing by integer or name:
l <- list(a=3.5, b=0.4, c=c(9.1, 5.8), d=7.7)

l[1] # returns a list with a single value of 3.5
[[1]]
[1] 3.5
l[[1]] # returns 3.5
[1] 3.5
l[[3]] # returns c(9.1, 5.8)
[1] 9.1 5.8

[[ and $

l["a"] # returns list(a=3.5)
$a
[1] 3.5
l$a # returns 3.5
[1] 3.5
l[["a"]] # also returns 3.5
[1] 3.5

Logical Tests and Comparators

Logical Tests and Comparators

  • R recognizes logical values as a distinct type

  • R provides all the conventional infix logical operators:

    1 == 1 # equality
    [1] TRUE
    1 != 1 # inequality
    [1] FALSE
    1 < 2 # less than
    [1] TRUE
    1 > 2 # greater than
    [1] FALSE
    1 <= 2 # less than or equal to
    [1] TRUE
    1 >= 2 # greater than or equal to

Logical Tests on Vectors

  • Logical tests are applied to each element of vectors:

    x <- c(1,2,3)
    x == 2
    [1] FALSE TRUE FALSE
    x < 1
    [1] FALSE FALSE FALSE
    c(1,2) == c(1,3)
    [1] TRUE FALSE
    c(1,2) != c(1,3)
    [1] FALSE TRUE
    c(1,2) == c(1,2,3)
    [1] TRUE TRUE FALSE
    Warning message:
    In c(1, 2, 3) == c(1, 2) :
      longer object length is not a multiple of shorter object length
    

Logical Tests on Vectors

  • all() function returns a single boolean value when all results are true:

    x <- c(1,2,3)
    all(x == 2)
    [1] FALSE
    all(x > 0)
    [1] TRUE

Testing the type of a variable

  • R provides many functions of the form is.X where X is some type or condition

    is.numeric(1) # is the argument numeric?
    [1] TRUE
    is.character(1) # is the argument a string?
    [1] FALSE
    is.character("ABC")
    [1] TRUE
    is.numeric(c(1,2,3)) # recall a vector has exactly one type
    [1] TRUE
    is.numeric(c(1,2,"3"))
    [1] FALSE
    is.na(c(1,2,NA))
    [1] FALSE FALSE TRUE

Functions

Functions Intro

  • A function is a symbolic representation of code
  • R provides very large number of functions for common operations
  • You can (will, and should) write your own functions
  • Functions are useful for:
    1. Making your code more concise and readable
    2. Allow you to avoid writing the same code over and over (i.e. reuse it)
    3. Allow you to systematically test pieces of your code
    4. Allow you to share your code easily with others
    5. Program using a functional programming style

Functional Programming

  • R is a functional programming language
  • Emphasizes using functions
  • Advantages of programs written in functional programming languages
    • Concise
    • Predictable
    • Provably correct
    • Performant (e.g. easily parallelizable)

Function Definitions

  • Functions usually accept and execute on different inputs

  • e.g. the mean function wouldn’t be very useful if it didn’t accept a value

    mean(c(1,2,3))
    [1] 2
  • The function must accept or allow you to pass it arguments

    # a basic function signature
    function_name(arg1, arg2) # function accepts 2 arguments

Function Terminology

# a basic function signature
function_name(arg1, arg2) # function accepts 2 arguments
  • arg1 and arg2 are arguments indicating this function accepts two arguments
  • function_name is the name of the function
  • The pattern of arguments it accepts is called the function’s signature.
  • Every function has at least one signature
  • arg1 and arg2 are positional arguments (i.e. order matters)

Most Functions Require Passed Arguments

  • Functions will raise an error if they don’t receive arguments they expect:

    mean() # compute the arithmetic mean, but of what?
    Error in mean.default() : argument "x" is missing, with no default
  • The specific arguments a function accepts is called the function signature

  • You can find details about a function signatures using the documentation, as described next

R Function Documentation

R Function Signatures

  • These are the function signatures for the mean() function:

    mean(x, ...)
    
    ## Default S3 method:
    mean(x, trim = 0, na.rm = FALSE, ...)
  • This function has two signatures

  • The names of the arguments in the signature (e.g. x) are the variable names the function uses in its code to refer to the parameters you pass

  • The named arguments in a function signature are called parameters or formal arguments

R Function Signatures Continued

mean(x, ...)

## Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)
  • The second signature of the mean function introduces two new types of syntax:
    • Default argument values - e.g. trim = 0 named arguments that have a default value if not provided explicitly.
    • Variable arguments - .... This means the mean function can accept arguments that are not explicitly listed in the signature. This syntax is called dynamic dots.

R Function Arguments

Specifying Arguments By Name

  • All function arguments can be specified by name

    # generate 100 normally distributed samples with mean 0 and
    # standard deviation 1
    my_vals <- rnorm(100,mean=0,sd=1)
    mean(my_vals)
    [1] -0.05826857
    mean(x=my_vals)
    [1] -0.05826857
  • To borrow from the Zen of Python:

    “Explicit is better than implicit.”

  • Passing arguments with their names can help avoid bugs!

Beware of ... dynamic dots

  • The ... argument catchall can be very dangerous.

  • It allows you to provide arguments to a function that have no meaning…

  • and R will not raise an error:

    # generates 100 normally distributed samples with mean 0 and
    # standard deviation 1
    my_vals <- rnorm(100,mean=0,sd=1)
    mean(x=my_vals,tirm=0.1)
    [1] -0.05826857
  • Did you spot the mistake?

Beware of ... dynamic dots

# generates 100 normally distributed samples with mean 0 and
# standard deviation 1
my_vals <- rnorm(100,mean=0,sd=1)
mean(x=my_vals,tirm=0.1)
[1] -0.05826857
mean(x=my_vals,trim=0.1)
[1]-0.02139839

DRY: Don’t Repeat Yourself

  • Don’t Repeat Yourself (DRY) principle in software engineering

  • You may find yourself writing the same code multiple times

  • DRY == Encapsulate common code into your own functions

    # 100 normally distributed samples, mean=20, stdev=10
    my_vals <- rnorm(100,mean=20,sd=10)
    # standardize
    my_vals_norm <- (my_vals - mean(my_vals))/sd(my_vals)
    mean(my_vals_norm)
    [1] 0
    sd(my_vals_norm)
    [1] 1
  • If you find yourself copying and pasting code from one part of your script to another, you are repeating yourself!

Writing your own functions

  • R allows you to define your own functions

  • Example that adds two arguments together:

    sum_args <- function(arg1, arg2) {
      # code that does something with arg1, arg2, etc
      some_result <- arg1 + arg2
      return(some_result) # explicit return
    }
    
    sum_args <- function(arg1, arg2) {
      # code that does something with arg1, arg2, etc
      arg1 + arg2 # implicit return is last expression in function
    }
  • You choose function name, arguments, implementation

Custom function definition example

  • Example for standardize() function:

    standardize <- function(x) {
      res <- (x - mean(x))/sd(x)
      return(res)
    }
    
    my_vals <- rnorm(100,mean=20,sd=10)
    my_vals_std <- standardize(my_vals)
    c(mean(my_vals_std), sd(my_vals_std))
    [1] 0 1
    
    my_other_vals <- rnorm(100,mean=40,sd=5)
    my_other_vals_std <- standardize(my_other_vals)
    c(mean(my_other_vals_std), sd(my_other_vals_std))
    [1] 0 1

A Note On Returning Values

  • The return() function is not strictly necessary in R

    standardize <- function(x) {
      res <- (x - mean(x))/sd(x)
      return(res)
    }
  • The result of the last line of code executed in the body of a function is returned by default

  • However, to again to borrow from the Zen of Python:

    “Explicit is better than implicit.”

  • Being explicit about what a function returns by using the return() function will make your code less error prone and easier to understand

Scope

  • In programming, every variable you define has a scope
  • A variable’s scope defines which parts of the code have access to that variable
  • A variable with universal or top-level scope can be accessed anywhere in a program
  • Variables defined inside functions can only be accessed from within that function

Scope continued

x <- 3
multiply_x_by_two <- function() {
  # x is defined outside the function
  # but is inside this function's scope
  y <- x*2
  return(y)
}
  • x and multiply_x_by_two have universal scope
  • y scope is limited to within the multiply_x_by_two function

Scope continued

  • The multiply_x_by_two function accesses x from the outer scope

    x <- 3
    multiply_x_by_two <- function() {
      # x is defined outside the function
      # but is inside this function's scope
      y <- x*2
      return(y)
    }
  • In general, accessing variables within functions from outside the function’s scope is very bad practice!

Scope continued

  • Functions should be as self contained as possible, any values they need should be passed as parameters

  • This is better:

    x <- 3
    multiply_by_two <- function(x) {
      # x is now bound to the formal argument to the function
      # not x in the outer scope
      y <- x*2
      y
    }

Iteration

  • Iteration refers to stepping sequentially through a set or collection of objects
  • In non-functional languages like python, C, etc. there are particular control structures that implement iteration, commonly called loops
  • E.g. for and while loops in python/Java/C
  • R has these features, but is designed to iterate in a functional way
  • Iteration in R should be done in two ways:
    • vectorized operations
    • functional programming with apply()

Warning about loops in R

  • Note that R does have for and while loop support in the language

  • However, these loop structures can have poor performance

  • Preference should generally be given to the functional style of iteration

    How To Avoid For Loops in R

  • If you really, really want to learn how to use for loops in R, read this, but don’t say I didn’t warn you when your code slows to a crawl for unknown reasons:

    R for Data Science - for loops

Vectorized operations

  • R knows how to perform many operations on vectors and matrices as well as individual values

    x <- c(1,2,3,4,5)
    x + 3 # add 3 to every element of vector x
    [1] 4 5 6 7 8
  • Equivalent python:

    x = [1, 2, 3, 4, 5]
    # list comprehension
    new_x = [x_i+3 for x_i in x]
    # for loop
    new_x = []
    for x_i in x:
      new_x.append(x_i+3)

Vectorized operations on matrices

  • Matrices also support element-wise operations

    x_mat <- matrix(c(1,2,3,4,5,6),nrow=2,ncol=3)
    x_mat + 3 # add 3 to every element of matrix x_mat
    [,1] [,2] [,3]
    [1,]    4    6    8
    [2,]    5    7    9
    # the * operator always means element-wise
    x_mat * x_mat
         [,1] [,2] [,3]
    [1,]    1    9   25
    [2,]    4   16   36

Linear algebra operations

  • R also has syntax for vector-vector, matrix-vector, and matrix-matrix operations

    # the %*% operator stands for matrix multiplication
    x_mat %*% c(1,2,3) # [ 2x3 ] * [ 3 ]
         [,1]
    [1,]   22
    [2,]   28
    # recall t() is the transpose function, making [ 2x3 ] * [ 3x2 ]
    x_mat %*% t(x_mat) # dot product
         [,1] [,2]
    [1,]   35   44
    [2,]   44   56
  • R is optimized for vectorized computation, if you can cast your iteration into a vector or matrix multiplication, it is a good idea to do so.

Functional programming

  • R is a functional programming language, designed around the use of functions
  • Every function can be passed as a variable just like those bound to values
  • This means functions can be passed to other functions

Mathematical example

  • Consider a general formulation of vector transformation:

\[ \bar{\mathbf{x}} = \frac{\mathbf{x} - t_r(\mathbf{x})}{s(\mathbf{x})} \]

  • \(\mathbf{x}\) is a vector of real numbers
  • \(t_r(\mathbf{x})\) is a function that takes \(\mathbf{x}\) and returns a scalar (e.g. a central tendency like an arithmetic mean)
  • \(s(\mathbf{x})\) is a function that takes \(\mathbf{x}\) and computes a scalar scaling factor
  • \(\bar{\mathbf{x}}\) is defined as a vector of the same length where each value has had some average

Mathematical example continued

\[ \bar{\mathbf{x}} = \frac{\mathbf{x} - t_r(\mathbf{x})}{s(\mathbf{x})} \]

  • There are many different ways to define the central value \(t_r(\mathbf{x})\) of a set of numbers:

  • Similarly for scaling strategies \(s(\mathbf{x})\):

    • standard deviation, rescaling factor (e.g. set data range to be between -1 and 1), scaling to unit length (all values sum to 1), and others

Passing functions as functions

  • Consider our standardization function from earlier:

    standardize <- function(x) {
      res <- (x - mean(x))/sd(x)
      return(res)
    }
  • We have hard coded mean() and sd() as our central tendency and scale functions

  • We can pass these functions as parameters instead:

    # note R already has a built in function named "transform"
    my_transform <- function(x, t_r, s) {
      return((x - t_r(x))/s(x))
    }

Passing functions as functions

  • With the my_transform function:

    # note R already has a built in function named "transform"
    my_transform <- function(x, t_r, s) {
      return((x - t_r(x))/s(x))
    }
  • We can perform Z-score normalization by passing mean and sd as t_r and s, respectively:

    x <- rnorm(100,mean=20,sd=10)
    x_zscore <- my_transform(x, mean, sd) # functions passed as arguments
    mean(x_zscore)
    [1] 0
    sd(x_zscore)
    [1] 1

Generalized Transformation Function

  • Can use any functions for t_r and s so long as they accept a vector of numbers as first argument:

    x <- rnorm(100,mean=20,sd=10)
    x_transform <- my_transform(x, median, sum)
    median(x_transform)
    [1] 0
    # this quantity does not have an a priori known value
    # (or meaning for that matter, it's just an example)
    sum(x_transform)
    [1] 0.013

Custom Transformations

  • We can also write our own functions and pass them to my_transform

  • The following scales the values of x to have a range of \([0,1]\):

    data_range <- function(x) {
      return(max(x) - min(x))
    }
    # my_transform computes: (x - min(x))/(max(x) - min(x))
    x_rescaled <- my_transform(x, min, data_range)
    min(x_rescaled)
    [1] 0
    max(x_rescaled)
    [1] 1

apply() and friends

  • Passing functions as arguments allows us to iterate over collections
  • There are three apply() related functions you should use:
    • lapply(X, FUN) - for when you want a list returned
    • vapply(X, FUN, FUN.VALUE) - for when you want a vector returned
    • apply(X, FUN, MARGIN) - for when X is 2 dimensional (e.g. a matrix)

Note of warning: sapply()

  • sapply() is also available
  • this function automatically simplifies the result, i.e. it “guesses” what type of output you want
  • Can make your code unpredictable!
  • Recommend against using sapply()!

vapply() for vectors

  • The vapply() function is used for this, with the following signature:

    vapply(X, FUN, FUN.VALUE, ...)
  • X is one-dimensional collection of items (i.e. a list or vector)

  • FUN is the name of a function that can accept any of the items in the list

  • FUN.VALUE is an example value to indicate the type of the returned vector (recall all items in a vector must have the same type)

  • vapply() returns a new vector of type typeof(FUN.VALUE) the same length as X where each item has had the function FUN applied to it

vapply() example

  • Consider the vectorized addition from above:

    x <- c(1,2,3,4,5)
    x + 3
    [1] 4 5 6 7 8
  • We can do the equivalent operation with vapply

    x <- c(1,2,3,4,5)
    add3 <- function(i) {
      i + 3
    }
    # below the 0 means we want a numeric vector back
    vapply(x, add3, 0)
    [1] 4 5 6 7 8

Functional operations on 2 dimensions

  • Recall the Z-score transformation defined earlier:

    standardize <- function(x) {
      res <- (x - mean(x))/sd(x)
      return(res)
    }
  • This function operates on a single vector

  • We sometimes want to transform each row or column of a matrix separately

  • The apply() function allows a function to be applied to either rows or columns of a 2 dimensional data structure like a matrix or data frame

The apply() function

  • This is the signature of the apply function, from the RStudio help(apply) page:

    apply(X, MARGIN, FUN, ..., simplify = TRUE)
  • X is a matrix or data frame (i.e. a rectangle of numbers)

  • MARGIN indicates whether function should be applied on rows (MARGIN=1) or columns (MARGIN=2)

  • FUN is the name of a function that accepts a vector and returns either a vector or a scalar value

  • apply() then executes FUN on each row or column of X and returns the result

apply() example

zscore <- function(x) {
  return((x-mean(x))/sd(x))
}
# construct matrix of 50x100 normally distributed samples
x_mat <- matrix( rnorm(100*50, mean=20, sd=5),
  nrow=50,
  ncol=100
)
# z-transform the rows so that each column has mean,sd of 0,1
x_mat_zscore  <- apply(x_mat, 2, zscore)
# check columns of x_mat_zscore have mean close to zero with apply
x_mat_zscore_means <- apply(x_mat_zscore, 2, mean)
# note: due to machine precision errors, these results will not be
# exactly zero, but are very close
# note: the all() function returns true if all elements are TRUE
all(x_mat_zscore_means<1e-15)
[1] TRUE

lapply() Function

  • lapply() iterates over the elements of X and returns a list with the result

    lapply(X, FUN, ...)
  • Example:

    x <- list(
      feature1=rnorm(100,mean=20,sd=10),
      feature2=rnorm(100,mean=50,sd=5)
    )
    x_zscore <- lapply(x, zscore)
    # check that the means are close to zero
    x_zscore_means <- lapply(x_zscore, mean)
    all(x_zscore_means < 1e-15)
    [1] TRUE

Troubleshooting and Debugging

Troubleshooting and Debugging

  • Bugs in code are normal
  • Two kinds of bugs:
    • Syntax errors - code will not run, R will tell you about it
    • Logic errors - code does run, but produces incorrect results
  • You are not a bad programmer if your code has bugs
  • Some bugs can be very difficult to fix, and some are even difficult to find
  • You will spend a substantial amount of time debugging your code in R, especially as you are learning
  • Be patient with yourself and others

Finding questions and answers

  • “It’s always ok to ask for help, but it’s always to your advantage to figure it out yourself.”
  • You will encounter R error and warning messages routinely during development, and not all of them are straightforward to understand.
  • It is important that you learn how to seek the answers to the problems R reports on your own
  • Your colleagues (and instructors!) will thank you for it.

Debugging Strategies

Debugging strategy 1: Google!

  • Copy and paste the error into google and see what comes back
  • Especially when starting out, the errors you receive have been encountered countless times by others before you
  • Solutions/explanations of them are already out there
  • If you aren’t already familiar with Stack Overflow, you will be very soon

Debugging strategy 2: Make it repeatable

  • When you encounter an error, don’t change anything in your code right away
  • Try again to make sure you get the same error again
  • This may require you to isolate the code with the error in a different setting to make it easier to run
  • If you do, this means the error is repeatable, or replicable, and you can now try modifying the code in question to see if and how the error changes.

Debugging strategy 3: Where is the bug?

  • Finding out where the bug is can be hard!
  • Most bugs involve multiple lines of code,
  • Only a subset of which contains the actual error
  • Sometimes the exact line where the error occurs is obvious, but other times the error is a consequence of a mistake assumption made earlier in the code.

Debugging strategy 4: Fix it, test it

  • When you have identified the specific issue causing the bug, modify the code so it produces the correct result
  • Then rigorously test your fix to make sure it is correct
  • Sometimes making one change to code causes side effects elsewhere in your code in ways that are difficult to predict
  • Ideally, you have already written unit tests that explicitly test parts of your code
  • If not you will need to use other means of convincing yourself that your fix worked.

The Debugging Loop

  • The most basic debugging loop is:

    1. Write code
    2. Run code
    3. Print out results
    4. Compare to expected result
    5. Go to 1

Debugging tools in RStudio

  • RStudio, the Environment Inspector in the top right of the interface makes inspecting the current values of your variables very easy
  • You can also easily execute lines of code from your script in the interpreter at the bottom right
  • The str() function can be helpful when in an interpreter and not in RStudio
  • RStudio has many more debugging tools you can use
  • Check out the section on Debugging in Hadley Wickam’s Advanced R book

Coding Style and Conventions

  • Common worries I get from students:
    • “Is my code terrible?”
    • “How do I write good code?”
  • There is no gold standard for what makes code “good”
  • BUT there are some questions you can ask of your code as a guide

Is my code correct?

  • Does it produce the desired output?
  • It can be harder to be sure of this than you might think, especially as your code becomes more complicated
  • Simple trial and error is an effective first approach
  • A more reliable albeit time- and thought-intensive strategy is to write explicit tests for your code and run them regularly
  • The homework assignments use explicit tests

Does my code follow the DRY principle?

  • Don’t Repeat Yourself (DRY) is a powerful and helpful strategy to make your code more reliable
  • This typically involves identifying common patterns in your code and moving them to functions or objects

Did I choose concise but descriptive variable and function names?

  • Variable and function names should be descriptive when necessary and not too long
  • Try to put yourself in the shoes of someone who is reading your code for the first time
  • Can you can figure out what it does?
  • Better yet, offer to buy a friend food/a beverage in return for them looking at it!

Coding Convention Consistency

  • Did I use indentation and naming conventions consistently throughout my code?
  • Consistently formatted code is much easier to read (and possibly understand) than inconsistent code

Poor Consistency Example

  • Consider the code:

    calcVal <- function(x, a, arg=2) { return(sum(x*a)**2)}
    calc_val_2 <- function(x, a, b, arg) {
    res <- sum(b+a*x)**arg
    return(res)}

Issues With This Code

This code is inconsistent in several ways:

  • naming conventions - calcVal camel case, calc_val_2 snake case
  • new lines and whitespace - calcVal is all on one line, calc_val_2 is on multiple lines
  • unhelpful indentation - calc_val_2 function body not indented, close curly brace is appended to the last line
  • unhelpful function and argument names - the function/parameter names don’t describe what they do/mean
  • unused function arguments - the arg argument in calcVal isn’t used anywhere in the function
  • the two functions appear to do something very similar and could be made simpler using a default argument

Better Consistency Example

A more consistent version of this code might look like:

exponent_product <- function(x, a, offset=0, arg=2) {
  return(sum(offset+a*x)**arg)
}

This code is much cleaner, more consistent, and easier to read.

Did I write comments, especially when what the code does is not obvious?

  • Sometimes what a piece of code does is obvious from looking at it:
x <- x + 1
  • Consider explaining why a piece of code does what it does:
# add 1 as a pseudocount
x <- x + 1

How easy would it be for someone else to understand my code?

  • If someone else who has never seen my code before is asked to run and understand it…
  • How easy would it be for them to do so?
  • You will encounter situations where you need to figure out what you yourself were thinking when you wrote a piece of code
  • Endeavor to make future you proud of current you!

Is my code easy to maintain/change?

  • Well written code is easier to understand
  • Code that is easy to understand is easier to modify than abstruse code
  • You will gain a sense for this over time