Week 2: ChIPseq

Aligning reads to the reference genome

Now that we have performed basic quality control on our reads, we are going to align them to the human reference genome using bowtie2. Using the index you generated last week, align each of the FASTQ files to the human reference genome. By default, bowtie2 outputs in SAM format and typically, it is more convenient and friendly to your disk space to work with BAM files as they are compressed. Nearly all utilities that operate on SAM files can also interchangeably work with BAM files

1. Align your reads to the human reference genome using the index you
  created last week 
    - Leave all other bowtie2 parameters at default
    - In the same command, automatically convert the bowtie2 output to a BAM
      file using samtools

Sorting and indexing our alignment files

For many downstream applications, most notably visualization, the alignments in BAM files need to be sorted and indexed, in order to allow for fast and random access without searching the entire file. We will now sort, and index all of our generated BAM files using samtools. Read the manual pages for the samtools sort and samtools index commands for how to use them.

1. Generate a snakemake rule that sorts each of your BAM files 
2. Generate another snakemake rule that creates a BAM index for your sorted 
  files

Quality control with flagstats and MultiQC

As before, we will perform a quick quality check on our alignment using samtools and then collect all of the QC information we have generated so far using MultiQC.

1. Create a snakemake rule that runs samtools flagstat on all of your
  aligned files
2. Create a snakemake rule that runs MultiQC on your project directory

Generating bigwig files and comparing “similarity” between samples

Now that we have performed basic quality control, we are going to convert our aligned BAM files to a more tractable format for analysis and visualization. We will be using the deeptools package to convert the information in our BAM files to a bigWig format, which is a compressed binary indexed file that is typically used to store quantitative data associated with genomic regions. In essence, we can divide every chromosome into bins of equal and known size and count the number of alignments falling into each region. This enables a simple but intuitive plot where we can visualize the entire length of any arbitrary region in the genome and the number of alignments found in each region. This enables us to look at the “signal” coming from various regions of the genome.

This is often used in ChIPseq experiments to visualize where our factor of interest is binding. Regions with higher alignment counts, if the underlying molecular biology experiment worked, should represent regions where our factor is most likely binding. Understanding where in the genome a factor is binding is what enables the determination of what genes that factor is regulating, and the mechanisms by which it functions.

We can also use this information to make some general quantitative comparisons between our samples since they are all unified in format (i.e. each bigWig representation of a sample consists of the entire genome split into the same sized bins and with an associated count).

We will use this property to perform a Pearson correlation on the signal generated from each of the samples. In an ideal world, if the experiment worked exactly as intended, and both biological replicates were correctly capturing Runx1 binding sites, their signal represented in bigWig format should be exactly the same or at the very least, highly correlated. Given our knowledge of what the input control represents, we could also assume that our inputs should be substantially different than our IP samples.

1. Using the bamcoverage utility in deeptools, generate a bigWig file for
  each sorted BAM
    - Leave all parameters at default
  
2. Once you have produced all of the bigwig files, use the
  multiBigWigSummary utility and the plotCorrelation utility to produce a
  clustered heatmap of the Pearson correlation values between all the samples.
  Refer to their respective manual pages here for help:
  https://deeptools.readthedocs.io/en/develop/content/list_of_tools.html

Week 2 Tasks

  • Align your reads using bowtie2 to your previously created bowtie2 genome index for the full human reference genome. Make sure you are automatically converting the output of bowtie2 to a BAM file using | and samtools.

  • Create a snakemake rule that sorts each of your BAM files

  • Create a snakemake rule that generates a BAM index file for your sorted BAM files

  • Create a snakemake rule that runs samtools flagstat on each of your samples

  • Create a snakemake rule that runs MultiQC to aggregate all of the outputs from QC utilities

  • Create a snakemake rule that generates a bigWig file from each of your sorted BAM files

  • Create a snakemake rule that runs multiBigwig summary to generate a single matrix containing all of the information from your bigwigs

  • Create a snakemake rule that runs plotCorrelation to generate a figure displaying the pearson correlation values between all of our samples