Week 1: RNAseq

Read the paper

This analysis is based on the data published in this study: O’Meara et al. Transcriptional Reversion of Cardiac Myocyte Fate During Mammalian Cardiac Regeneration. Circ Res. Feb 2015. PMID: 25477501

Please read the paper to get a sense for the overall hypotheses, goals, and conclusions presented. Keep in mind we are only analyzing the data from the first figure, and only the dataset involving the samples from p0, p4, p7 and adult mouse heart ventricle cells.

We will also be deviating from the published methods in favor of using more up-to-date strategies and tools. However, the workflow you will develop should still capture and recreate the same core biological signals observed in the original work.

Create a working directory for project 1

Please sign on to the SCC and navigate to our class project directory located at: /projectnb/bf528/students/your_username/

Accept the github classroom link and clone the assignment to your working directory - https://classroom.github.com/a/WXj2OrjA

The structure and setup of the project

After you have accepted the github classroom link, clone the repo into your student directory (/projectnb/bf528/students/your_name/).

Within /projectnb/bf528/students/your_name/project_1/, you should see the following structure (directories are indicated by names ending with “/”):

|-- results/
    |-- Any processed data output, downloaded file, etc. should go in here, 
        remember to provide this path as an output for any tasks
    
|-- samples/
    |-- Copy the provided fastq files here, remember to provide this path 
        when referring to your input files
    
|-- docs/
    |-- This will contain a copy of each week's instructions and any
       other documents necessary for the required tasks
       
|-- week1.snake
    |-- We have provided you a template for the first week's snakefile
    
|-- week2.snake
    |-- A template for the week 2 snakefile
    
|-- week3.snake
    |-- A template for the week 3 snakefile

|-- week(1, 2, 3).Rmd
    |-- We have provided you with a Rmd notebook for you to quickly
        write your preliminary methods and results for each week
        
|-- differential_expression.Rmd
    |-- For the end of week 3 and the tasks for week 4, you will need
        to perform a series of operations in R. Please use this provided 
        notebook to do these. 

Create a conda environment for this project

One of the core advantages of conda is that it allows you to create multiple, independent computing environments with different sets of packages installed with conda handling dependencies. It is generally a good practice to create a distinct, self-contained conda environment for each individual dataset / experiment. If you did not go through assignment 0, then please follow the instructions here to setup conda: https://www.bu.edu/tech/support/research/software-and-programming/common-languages/python/python-software/miniconda-modules/

1. Please create a conda environment for this project, you may name it
whatever is easiest for you to remember

Take care to activate your conda environment prior to working on any aspect of the project. The tools installed in these environments are only accessible when you have activated the environment prior.

2. Using the conda install command, please install
the following packages: snakemake, pandas, fastqc, star, samtools, multiqc,
verse, bioconductor-deseq2, bioconductor-fgsea, jupyterlab

Locating the data

For this first project, we have provided you with small subsets of the original data files. You should copy these files to the samples/ folder in your working directory that you cloned. For your convenience, we have renamed the files to be descriptive of the samples they represent following the pattern: {timepoint}{rep}subsample_{readpair}.fastq.gz (i.e. ADrep1subsample_R1.fastq.gz)

The files are located here: /projectnb/bf528/materials/project_1_rnaseq/subsampled_files and there are 16 total files (8 samples, paired end reads).

1. Using `cp`, make all of these files available in your samples/ directory       

Performing Quality Control

At this point, you should have:

  • Setup a directory for this project by accepting the github classroom link
  • Copied the files to your working directory
  • Created a new conda environment containing the requested packages
  • Activate your environment before doing any work

We will begin by performing quality control on the FASTQ files generated from the experiment. fastQC is a bioinformatics software tool that calculates and generates descriptive graphics of the various quality metrics encoded in a FASTQ file. We will use this tool to quickly check the basic quality of the sequencing in this experiment.

Your first task is to develop the provided snakemake workflow (week1.snake) that will run fastQC on each of the 16 project files. You will have to make appropriate use of the expand() function to construct a working snakefile to perform these steps.

You may find it helpful to utilize the following command: snakemake -s {insert_your_snakefile_name} --dryrun -p

This command will instruct snakemake to print out a summary of what it plans to do based on how your snakefile is written. This command is useful for quickly troubleshooting if you have any simple errors in syntax, wildcard matching, etc. Please note that it will not actually run any of the code in your rule directives and thus will not allow you to determine if there are any issues or errors in your actual shell or python arguments / commands.

1. Set up your snakemake workflow to run FastQC on each of the 16 FASTQ
files (8 samples)
  - Remember all produced files / data should be output to your results/
    directory

By default, FastQC will create a .html and a .zip file for each FASTQ file. It will create these files with the following pattern: {original_filename}_fastqc.html and {original_filename}_fastqc.zip

You may have noticed that FastQC produces separate outputs for each sample. Many other QC programs also operate on a file level and analyze single files at a time. This dataset is on the smaller side but larger experiments can encompass hundreds of samples. Individually inspecting outputs from all of these utilities would quickly become cumbersome in such situations.

We are going to utilize the MultiQC utility, which recognizes the outputs from many standard bioinformatics tools, and concatenates their outputs into a single, well-formatted and visually attractive report.

1. Read the multiQC manual to learn more about its operation and create a
snakemake rule that runs MultiQC.
  - Remember to have MultiQC use your samples/ directory as an input and
    output its results to the results/ directory

You should have three rules: your rule all, the rule that runs fastQC on each of the files, and the rule that runs multiqc.

At the end of the first week, you should make sure to have done the following:

  • Read the paper and understand the goals and hypotheses of the original study. Focus in particular on the analyses pertaining to the dataset we are examining and make note of the interpretations and conclusions drawn by the authors.
  • Accept the github classroom link to setup the project in your working directory
  • Make a new conda environment with the specified packages for this experiment
  • Copy the subset files to your results/ directory
  • Complete the provided snakemake workflow (week1.snake) that runs fastQC on all 16 samples
  • Generate a snakemake rule that runs multiqc after all fastQC reports have been completed
  • Write a brief methods and results section for this week’s tasks in the provided markdown file (week1.Rmd)
  • Answer any conceptual questions (also contained in the Rmd) - many of these questions do not have a single right answer and are designed to allow you to make hypotheses and think creatively about research questions.

CHALLENGE

If you are already familiar with high performance computing environments, you no doubt understand that you should not run any intensive computational tasks on the head node.

Please only attempt this section if you are proficient and knowledgeable in the use of qsub to properly submit batch jobs with appropriate resources requested.

The original data files are located in our single project directory, /projectnb/bf528/materials/project_1_rnaseq/full_files/

Reformat your snakefile to operate on the full data files; their name will not include “subsample” but otherwise follow the same exact naming conventions mentioned previously.

  1. Using your snakemake workflow, run FastQC on all 16 original data files
  2. Add a rule that also runs MultiQC after completion of all FastQC analyses

Your snakemake command used to run must include the –cluster option and appropriate qsub arguments to avoid running these jobs on the head node. We will be discussing the appropriate usage of qsub and –cluster in the upcoming weeks.