• Syllabus
    • Course Schedule
    • Instructors
    • Office Hours
    • Course Values and Policies
  • 1 Introduction
    • 1.1 Who This Book Is For
    • 1.2 A Note About Reinventing the Wheel
    • 1.3 Sources and References
  • 2 Data in Biology
    • 2.1 A Brief History of Data in Molecular Biology
    • 2.2 Biology as a Mature Data Science
  • 3 Preliminaries
    • 3.1 The R Language
    • 3.2 RStudio
    • 3.3 The R Script
    • 3.4 The Scripting Workflow
    • 3.5 git + github
      • 3.5.1 Motivation
      • 3.5.2 git
      • 3.5.3 Git hosting platforms (GitHub)
  • 4 R Programming
    • 4.1 Before you begin
    • 4.2 Introduction
    • 4.3 R Syntax Basics
    • 4.4 Basic Types of Values
      • 4.4.1 Factors
    • 4.5 Data Structures
      • 4.5.1 Vectors
      • 4.5.2 Matrices
      • 4.5.3 Lists and data frames
    • 4.6 Logical Tests and Comparators
    • 4.7 Functions
      • 4.7.1 DRY: Don’t Repeat Yourself
      • 4.7.2 Writing your own functions
      • 4.7.3 Scope
    • 4.8 Iteration
      • 4.8.1 Vectorized operations
      • 4.8.2 Functional programming
      • 4.8.3 apply() and friends
    • 4.9 Installing Packages
    • 4.10 Saving and Loading R Data
    • 4.11 Troubleshooting and Debugging
    • 4.12 Coding Style and Conventions
      • 4.12.1 Is my code correct?
      • 4.12.2 Does my code follow the DRY principle?
      • 4.12.3 Did I choose concise but descriptive variable and function names?
      • 4.12.4 Did I use indentation and naming conventions consistently throughout my code?
      • 4.12.5 Did I write comments, especially when what the code does is not obvious?
      • 4.12.6 How easy would it be for someone else to understand my code?
      • 4.12.7 Is my code easy to maintain/change?
      • 4.12.8 The styler package
  • 5 Data Wrangling
    • 5.1 The Tidyverse
    • 5.2 Tidyverse Basics
    • 5.3 Importing Data
    • 5.4 The tibble
    • 5.5 Tidy Data
    • 5.6 pipes
    • 5.7 Arranging Data
      • 5.7.1 dplyr::mutate() - Create new columns using other columns
      • 5.7.2 stringr - Working with character values
      • 5.7.3 dplyr::select() - Subset Columns by Name
      • 5.7.4 dplyr::filter() - Pick rows out of a data set
      • 5.7.5 dplyr::arrange() - Order rows based on their values
      • 5.7.6 Putting it all together
    • 5.8 Grouping Data
    • 5.9 Rearranging Data
    • 5.10 Relational Data
      • 5.10.1 Dealing with multiple matches
  • 6 Data Science
    • 6.1 Data Modeling
      • 6.1.1 A Worked Modeling Example
      • 6.1.2 Data Summarization
      • 6.1.3 Linear Models
      • 6.1.4 Flavors of Linear Models
      • 6.1.5 Assessing Model Accuracy .
    • 6.2 Statistical Distributions
      • 6.2.1 Random Variables
      • 6.2.2 Statistical Distribution Basics
      • 6.2.3 Distributions in R
      • 6.2.4 Discrete Distributions
      • 6.2.5 Continuous Distributions
      • 6.2.6 Empirical Distributions
    • 6.3 Statistical Tests
      • 6.3.1 What is a statistical test
      • 6.3.2 Common Statistical Tests
      • 6.3.3 Choosing a Test
      • 6.3.4 P-values
      • 6.3.5 Multiple Hypothesis Testing .
      • 6.3.6 Statistical power
    • 6.4 Exploratory Data Analysis
      • 6.4.1 Principal Component Analysis
      • 6.4.2 Cluster Analysis
      • 6.4.3 k-means .
      • 6.4.4 Others
    • 6.5 Network Analysis .
      • 6.5.1 Basic network analysis
      • 6.5.2 Create your own
  • 7 Data Visualization
    • 7.1 Responsible Plotting
      • 7.1.1 Visualization Principles
      • 7.1.2 Human Visual Perception
      • 7.1.3 Visual Encodings
      • 7.1.4 Some Opinionated Rules of Thumb
    • 7.2 Grammar of Graphics
      • 7.2.1 ggplot mechanics
    • 7.3 Plotting One Dimension
      • 7.3.1 Bar chart
      • 7.3.2 Lollipop plots
      • 7.3.3 Stacked Area charts
      • 7.3.4 Parallel Coordinate plots
    • 7.4 Visualizing Distributions
      • 7.4.1 Histogram
      • 7.4.2 Density
      • 7.4.3 Boxplot
      • 7.4.4 Violin plot
      • 7.4.5 Beeswarm plot
      • 7.4.6 Ridgeline
    • 7.5 Plotting Two or More Dimensions
      • 7.5.1 Scatter Plots
      • 7.5.2 Bubble Plots
      • 7.5.3 Connected Scatter Plots
      • 7.5.4 Line Plots
      • 7.5.5 Parallel Coordinate Plots
      • 7.5.6 Heatmaps
      • 7.5.7 How To Use Heatmaps Responsibly
    • 7.6 Other Kind of Plots
      • 7.6.1 Dendrograms
      • 7.6.2 Chord Diagrams and Circos Plots
    • 7.7 Multiple Plots
      • 7.7.1 Facet wrapping
      • 7.7.2 Multipanel Figures
    • 7.8 Publication Ready Plots
      • 7.8.1 ggplot Themes
      • 7.8.2 Exporting to SVG
    • 7.9 Network visualization
      • 7.9.1 Interactive network plots
  • 8 Biology & Bioinformatics
    • 8.1 R in Biology
    • 8.2 Biological Data Overview
      • 8.2.1 Types of Biological Data
      • 8.2.2 CSV Files
      • 8.2.3 Biological data is NOT Tidy!
    • 8.3 Bioconductor
    • 8.4 Microarrays
    • 8.5 High Throughput Sequencing
      • 8.5.1 Raw HTS Data
      • 8.5.2 Preliminary HTS Data Analysis
      • 8.5.3 Count Data
    • 8.6 Gene Identifiers
      • 8.6.1 Gene Identifier Systems
      • 8.6.2 Mapping Between Identifier Systems
      • 8.6.3 Mapping Homologs
    • 8.7 Gene Expression
      • 8.7.1 Gene Expression Data in Bioconductor
      • 8.7.2 Differential Expression Analysis
      • 8.7.3 Microarray Gene Expression Data
      • 8.7.4 Differential Expression: Microarrays (limma)
      • 8.7.5 RNASeq
      • 8.7.6 RNASeq Gene Expression Data
      • 8.7.7 Filtering Counts
      • 8.7.8 Count Distributions
      • 8.7.9 Differential Expression: RNASeq
    • 8.8 Gene Set Enrichment Analysis
      • 8.8.1 Gene Sets
      • 8.8.2 Working with gene sets in R
      • 8.8.3 Over-representation Analysis
      • 8.8.4 Rank-based Analysis
      • 8.8.5 fgsea
    • 8.9 Biological Networks .
      • 8.9.1 Biological Pathways .
      • 8.9.2 Gene Regulatory Networks .
      • 8.9.3 Protein-Protein Interaction Network .
      • 8.9.4 WGCNA .
  • 9 EngineeRing
    • 9.1 Unit Testing
    • 9.2 Toolification
      • 9.2.1 The R Interpreter
      • 9.2.2 Rscript
      • 9.2.3 commandArgs()
    • 9.3 Parallel Processing
      • 9.3.1 Brief Introduction to Parallelization
      • 9.3.2 apply and Friends Are Pleasingly Parallel
      • 9.3.3 The parallel package
    • 9.4 Object Orientation in R
    • 9.5 R Packages
  • 10 RShiny
    • 10.1 RShiny Introduction
      • 10.1.1 How does a web page work?
    • 10.2 Application Structure
      • 10.2.1 UI Design
      • 10.2.2 Server Functionality
    • 10.3 Reactivity
    • 10.4 Publishing
  • 11 Communicating with R
    • 11.1 RMarkdown & knitr
    • 11.2 bookdown
  • 12 Contribution Guide
    • 12.1 Custom blocks
  • Assignments
    • Assignment Format
    • Starting an Assignment
      • Starting your own git repository and cloning it
      • Contents
      • GitHub tutorial
      • Committing and Pushing with R (and without)
      • Using RStudio
      • Using the command line
    • Assignment 1
      • Problem Statement
      • Learning Objectives and Skill List
      • Instructions
      • Hints
    • Assignment 2
      • Problem Statement
      • Learning Objectives
      • Skill List
      • Instructions
      • Function Details
    • Assignment 3
      • Problem Statement
      • Learning Objectives
      • Skill List
      • Background on Microarrays
      • Background on Principal Component Analysis
      • Marisa et al. Gene Expression Classification of Colon Cancer into Molecular Subtypes: Characterization, Validation, and Prognostic Value. PLoS Medicine, May 2013. PMID: 23700391
      • Scaling data using R scale()
      • Proportion of variance explained
      • Plotting and visualization of PCA
      • Hierarchical Clustering and Heatmaps
      • References
    • Assignment 4
      • Problem Statement
      • Learning Objectives
      • Skill List
      • Instructions
      • Deliverables
      • Function Details
      • Hints
    • Assignment 5
      • Problem Statement
      • Learning Objectives
      • Skill List
      • DESeq2 Background
      • Generating a counts matrix
      • Prefiltering Counts matrix
      • Median-of-ratios normalization
      • DESeq2 preparation
      • O’Meara et al. Transcriptional Reversion of Cardiac Myocyte Fate During Mammalian Cardiac Regeneration. Circ Res. Feb 2015. PMID: 25477501l
      • 1. Reading and subsetting the data from verse_counts.tsv and sample_metadata.csv
      • 2. Running DESeq2
      • 3. Annotating results to construct a labeled volcano plot
      • 4. Diagnostic plot of the raw p-values for all genes
      • 5. Plotting the LogFoldChanges for differentially expressed genes
      • The choice of FDR cutoff depends on cost
      • 6. Plotting the normalized counts of differentially expressed genes
      • 7. Volcano Plot to visualize differential expression results
      • 8. Running fgsea vignette
      • 9. Plotting the top ten positive NES and top ten negative NES pathways
      • References
    • Assignment 6
      • Problem Statement
      • Learning Objectives
      • Skill List
      • Instructions
      • Function Details
    • Assignment 7
      • Problem Statement
      • Learning Objectives
      • Skill List
      • Instructions
      • “Function” Details
  • Appendix
  • A Class Outlines
    • A.1 Week 1
    • A.2 Week 2
    • A.3 Week 3
    • A.4 Week 4
    • A.5 Week 5
    • A.6 Week 6
    • A.7 Week 7
    • A.8 Week 8
    • A.9 Week 9
    • A.10 Week 10
    • A.11 Week 11
    • A.12 Week 12
    • A.13 Week 13
    • A.14 Week 14

A.2 Week 2

  • Assignment 1
  • Data Wrangle: Regular expressions
  • Bioinfo: R in Biology
  • Bioinfo: Types of Biological Data
  • Bioinfo: Bioconductor
  • Bioinfo: Gene Identifiers
  • Bioinfo: Mapping Between Identifier Systems
  • Bioinfo: Mapping Homologs
  • Data Wrangle: Relational Data
  • Data Viz: Grammar of Graphics
  • Data Viz: Plotting One Dimension
  • Data Viz: Visualizing Distributions
  • R Prog: Unit Testing