Week 3: ChIPseq

Peak calling

Now that we have our aligned reads and performed basic quality control, we are going to proceed to peak calling using HOMER. This utility requires us to first convert the BAM files to a HOMER specific format, which is required prior to using HOMER to run downstream utilities.

1. Generate a snakemake rule that runs the makeTagDir command in HOMER for 
  each sample. You will have a separate tag directory for each. 
    - Leave every option at default parameters.

As denoted by the name, this utility creates a directory containing a multitude of output files. Please refer to the snakemake documentation on how to properly handle using directories as expected outputs.

We are now ready to perform peak calling, which will look for areas of significant enrichment in the localization of our alignments. We will be performing peak calling using the input as our background enrichment control. Please read the documentation for the findPeaks command (http://homer.ucsd.edu/homer/ngs/peaks.html) and perform peak calling on both of our replicate experiments.

1. Generate a snakemake rule that runs findPeaks on your samples
    - You will need this rule to run twice on each set of paired replicates
    - Remember that Runx1 is a transcription factor that binds in relatively
    “narrow” patterns.

Converting homer-specific output to BED format

The output of findPeaks is in a homer-specific format, and for ease of use, we will convert this .txt homer peak file to a standard BED formatted file.

1. Generate a snakemake rule that use the pos2bed utility to convert the
  output to a BED format

Determining a set of reproducible peaks using bedtools

Each paired experiment (IP and INP) from each replicate will have generated a set of peaks that represent Runx1 binding locations in the genome. In general, there are two main ways that “reproducible” peaks are chosen for ChIP-seq experiments: 1. Irreproducible Discovery Rate (IDR), and 2. Intersections / Unions of peaks. You will use the latter strategy and Bedtools to produce your own set of “reproducible peaks” from your two peak files.

1. Using only bedtools, generate a single list of “reproducible peaks”. You
  may use whatever strategy you like and we will ask you to justify your
  choice later.

Removing signal-artifact regions from our peak lists

Experimentally, it has been shown that there are regions of the genome that are nearly always present in ChIPseq experiments at high levels. These regions are considered to be noise and signal-artifacts that are not representative of the target enrichment. It is generally advised to remove any peaks that are found in these regions as they likely represent erroneous signal.

1. Create a snakemake rule that uses bedtools to filter out any of your
  reproducible peaks that fall into these blacklisted regions and output the
  remaining peaks to a new BED file

Annotating our peaks to their nearest genomic feature

With our now filtered list of reproducible peaks, we are going to perform a simple annotation that assigns these peaks to their nearest genomic feature. This will allow us to glean some information of which genomic regions they fall into (promoters, introns, etc.) and what genes they are in close proximity to.

Please note that you may need to specify an older version of the human reference in the options for this tool (hg18/hg19). If you want to use the latest version (hg38), you may need to have homer install it directly. There are some instructions here on how to install hg38. (http://homer.ucsd.edu/homer/introduction/configure.html) or if you try to use this option, run annotatePeaks.pl, and it returns an error, it will include the command to install genomes in the error message.

Because of our conda setup, the paths you need to install new genomes into HOMER is specific to each of your environments.

1. Use the annotatePeaks.pl utility in HOMER to annotate your list of
  reproducible peaks. 
    - Provide the decompressed GTF file. 

Performing motif finding to look for enrichment of known binding sequences

Using this same set of reproducible peaks, we are going to perform motif finding to look for enriched motifs found in our peak locations. For simplicity, we will be performing this step using HOMER. As we are using the hg39 build of the human reference genome, you should supply the path to the FASTA as specified in the instructions for running findMotifsGenome.pl.

1. Run findMotifsGenome.pl on your list of reproducible peaks to generate a
  motif enrichment analysis
    - Do not use any of the provided HOMER options for the genome. You must
      first decompress the reference genome fasta and provide this in the
      HOMER command. 
      

Week 3 Tasks

  • Generated a snakemake rule that runs the makeTagDir utility on each of your sorted BAM files

  • Create a snakemake rule that runs HOMER findPeaks on your samples using the tag directories generated previously. This should run twice, once for each set of paired experiments (rep1 and rep2)

  • Created a snakemake rule that converts both of your findPeaks output files to BED format

  • Created a snakemake rule that uses only bedtools to generate a new BED file containing the “reproducible” peaks

  • Create a snakemake rule that uses bedtools to filter out any “reproducible” peaks falling into blacklisted regions

  • Create a snakemake rule that annotates the “reproducible” peaks to their nearest genomic feature

  • Create a snakemake rule that performs motif finding on your list of filtered, reproducible peaks

CHALLENGE

The authors of the original paper use the MEME suite of tools to perform motif finding. However, they do not specify the parameters used in this analysis and the input to MEME is not a list of peaks (which is our current output). For this optional challenge, instead of performing motif finding in HOMER, use MEME-ChIP as they did in the paper to discover enriched motifs.

In general, you will have to figure out how to do the following:

  • Install the Meme suite into your conda environment

  • Extract chromosome sizes from the human gencode reference

  • Determine the average size of your peaks

  • Use your chromosome sizes, the average size of your peaks, and BedTools to make a BED file containing all of your peaks expanded to around ~500bp in size.

  • Extract the DNA sequences from the regions in your BED file from a bgzipped genomic FASTA file for the gencode reference.

  • Run MEME-ChIP using your FASTA file of DNA sequences covered by your peaks