Week 3: ChIPseq
Section Links
Create and use a single Jupyter notebook for any images, reports or written discussion
Download a HG38 gene BED from UCSC table browser
Week 3 Overview
In week 3, you will be using the UCSC table browser to obtain a BED file containing the start and end positions of every gene in the HG38 human reference genome. This will enable you to plot the signal coverage from your samples in relation to genic structure (Transcription Start Site and Transcription Termination Site). You will also be performing motif enrichment to determine what motifs appear to be enriched in the binding sites detected in your peaks.
Objectives
Extract the TSS and TTS for every gene in the hg38 reference in BED format using the UCSC table browser
Use the deeptools utilities computeMatrix and plotProfile, the UCSC BED, and your IP sample bigwigs to create a signal intensity plot
Perform motif enrichment on your reproducible and filtered peaks using HOMER
Containers for Project 2
FastQC: ghcr.io/bf528/fastqc:latest
multiQC: ghcr.io/bf528/multiqc:latest
bowtie2: ghcr.io/bf528/bowtie2:latest
deeptools: ghcr.io/bf528/deeptools:latest
trimmomatic: ghcr.io/bf528/trimmomatic:latest
samtools: ghcr.io/bf528/samtools:latest
macs3: ghcr.io/bf528/macs3:latest
bedtools: ghcr.io/bf528/bedtools:latest
homer: ghcr.io/bf528/homer:latest
homer/samtools: ghcr.io/bf528/homer_samtools:latest
Create and use a single Jupyter notebook for any images, reports or written discussion
Please create a single jupyter notebook (.ipynb) that contains all of the
requested figures, images or discussion requested. As in Project 0, please
create a dedicated .yml file that specifies any needed packages (including
an up-to-date installation of ipykernel
). You may refer back to project 0
for how this setup works: https://bu-bioinfo.github.io/bf528/week-4-genome-analytics.html#make-a-conda-environment-with-pycirclize-and-ipykernel
This will enable your notebook to utilize the conda environment described in that yml file and ensure that your analysis is done in a reproducible and potentially portable manner.
Download a HG38 gene BED from UCSC table browser
We will be creating a plot which will provide a quick visualization of the average signal across the gene body of all genes. We will scale every gene to a uniform size and display the counts of alignments falling in the annotated regions of the gene. This will allow us to quickly visualize at a very high-level where we see the majority of binding for our factor of interest.
To do this, we have already generated our bigWig files, but we will require the genomic coordinates of all of the genes in the reference genome. We will be using the UCSC table browser to extract out this information.
Navigate to the UCSC Table Browser, use the following settings to extract a BED file listing the TSS/TTS locations for every gene in the reference genome:

On the following page, do not change any options and you will be prompted to download a BED file containing the requested information.
1. Put this BED file into your `refs/` working directory on SCC.
I have also provided this bed file in the materials/ directory for project 2.
This is a simple use case, but the UCSC table browser and UCSC genome browser are incredibly powerful tools and repositories for genome-wide sequencing data.
Generating a signal intensity plot for all human genes using computeMatrix and plotProfile for IP samples
Now that we have our bigWig files (count of reads falling into bins across the genome) and the BED file of the start and end position of all of the genes in the hg38 reference, we will calculate and visualize the signal falling into these annotated regions.
Use the
computeMatrix
utility in deeptools, your bigWig files, and the BED file you downloaded to generate a matrix file containing the counts of reads falling into the regions in the bed files.Ensure that you use the scale-regions mode, and you specify the options to add 2000bp of padding to both the start and end site.
We are not interested in visualizing the input samples (which should represent random background noise), use an appropriate nextflow operator to ensure this is only done for the IP samples.
Use the outputs of
computeMatrix
for theplotProfile
function and generate a simple visualization of the read counts from the IP samples across the body of genes from the hg38 reference.In your created notebook, please write a short paragraph describing the figure including details regarding how it was generated and what it appears to indicate about your data.
Finding enriched motifs in ChIP-seq peaks
We will be using the single set of reproducible and filtered peaks from last week to search for enriched motifs in our peaks. Many DNA binding proteins have been found to have higher affinity for specific DNA binding sites with recurring sequence and pattern. These motifs may reveal key information about gene regulation by allowing for determination of what factors are binding in peaks. Remember that many DNA binding proteins bind as part of much larger multi-protein complexes that work in tandem to regulate gene expression. We will be using the HOMER tool to perform motif enrichment, you may find the manual here
Use the
findMotifsGenome.pl
utility in homer to perform motif enrichment analysis on your set of reproducible and filtered peaks.In the notebook you create, please make a table or take a screenshot of the top ten enriched motifs that are found from the motif analysis.
Write a short paragraph discussing the top results you found and what you believe the results indicate.
Week 3 Tasks Summary
Use the UCSC table browser to generate a BED file containing the TSS and TTS positions of every gene in the HG38 reference
-
Create nextflow modules and update your script to perform the following tasks:
Runs computeMatrix to generate a matrix containing the read coverage relative to the gene positions in the BED file
Uses plotProfile to visualize the results generated by computeMatrix for each of the two IP samples
Utilizes HOMER to perform basic motif finding on your reproducible and filtered peaks