Week 1: ChIPseq
Repo Setup
I have included a .gitignore file with every repo that should instruct git to
ignore any non-code related files. Do not alter or delete this file. As a
reminder, you should not push any data to github and you should always avoid
adding any large files to github. Only push code and other simple text files to
github. It’s tempting to simply perform a git add *
, but I highly encourage
you to be more granular and intentional by explicitly listing the files in your
git add
commands.
The github classroom link is here: https://classroom.github.com/a/wWocEgqL
The structure of your templated directory should resemble the following:
|-- results/
|-- Any processed data output, downloaded file, etc. should go in here.
Remember to provide this path as an output for any tasks. I suggest
you employ a "flat" directory strategy and do not create any nested
directories inside results/, but you may do whatever works best for
you
|-- samples/
|-- The provided or downloaded fastq files should go here, remember to
provide this path when referring to your input files. You should
have both the subsampled and full data in this directory.
|-- envs/
|-- Contained within are the .yml files that describe the conda
environments which will now be created per rule. You should not at
any time alter or edit these files for any reason.
|-- refs/
|-- week(1, 2, 3, 4)_dag|rulegraph.png and complete_dag|rulegraph.png
|-- We've discussed that workflows can be represented visually in the
form of a DAG (directed acyclic graph). These diagrams represent the
tasks and dependencies your completed workflow for each week should
perform and follow. Use these as a reference to visually remind
yourself the goals of each week's workflow. We will discuss in class
how you can automatically generate these from your own snakefiles.
|-- docs/
|-- week(1, 2, 3, 4)_methods.Rmd
|-- We have provided you with a Rmd notebook for you to quickly
write your preliminary methods for each week and display any
preliminary results generated.
|-- week1.snake
|-- We have provided you a template for the first week's snakefile. Do
not alter the `conda` directive in each rule at any time.
|-- week2.snake
|-- A template for the week 2 snakefile. Do not alter the `conda`
directive in each rule at any time.
|-- week3.snake
|-- A template for the week 3 snakefile. Do not alter the `conda`
directive in each rule at any time.
|-- week4.snake
|-- A template for the week 4 snakefile. Do not alter the `conda`
directive in each rule at any time
|-- base_env.yml
|-- This is the yml file you should use to create the conda environment
for project 2.
|-- sample_sheet.csv
|-- A CSV file containing information regarding the samples / data files
for this experiment. There is a column for FTP links, which will be
used to download both the subsampled and full files.
|-- .gitignore
|-- A file that instructs git what to track and what not to track. I
have automatically configured this to just track snakefiles and text.
Do not alter or delete this file at any time.
Initial Setup
Please thoroughly read the landing page Project 2 - ChIPseq
I have made a large structural change to how we will be utilizing conda environments with our workflows.
An alternative to defining lists of samples in each rule
For the first project, we manually created python lists containing the name patterns used to drive each snakemake workflow. For this project, we will be employing an alternative method that stores this information in a fixed location in a .CSV file. Since we are creating this workflow in different snakefiles, we will still need to read in this information each time, but later when we connect our separate workflows into one, we will only need to do this once.
Removing the sample information from our snakefiles / workflows also helps to generalize our workflows and would theoretically enable us to run the same workflow on a different set of samples by simply swapping in a different CSV.
Downloading the data
Last project, you gained some experience downloading files. For convenience, I
have provided you in materials/project_2_chipseq/ the most up-to-date gencode
human primary assembly reference genome and matching GTF. Please simply copy
these files to your results/
directory in your working directory.
We will, however, be incorporating the downloading of the actual data files into
our workflows. Take a look at the provided sample_sheet.csv
in the repo
template you cloned. This contains all of the information about our samples and
the condition
column indicates which samples represent our IP and which are
the input control. You’ll also notice the column ftp_link
and an associated
address for where each file can be downloaded from. These are small subsampled
files that I created and uploaded to Zenodo (an open source data sharing
platform for science). You can use these small files to quickly test your code
and they should run and finish quickly.
You’ll notice that in the first week’s snakefile, I have provided you a rule that
automatically downloads these files. At the same time, it also renames these
files to make them more amenable to working in snakemake with wildcards. We will
have discussed roughly what this rule wget_files
is doing, and how it uses a
lambda
expression to define a function on the fly. Do not alter this rule as
it will be responsible for setting up your files with consistent naming patterns.
N.B.
You’ll notice that the output of the rule that downloads your files sets up one
way to work with wildcards for this project. You may use this pattern,
{condition}_{rep}
along with the values from the CSV loaded in at the top of
the snakefile, the python variables CONDITIONS and REPS, to setup your wildcards
for this snakefile.
Listed below will be the tasks you are responsible for accomplishing this week:
Quality Control
Similar to last project, we are going to check some of the main quality metrics for NGS data using fastQC.
1. Create a snakemake rule that runs fastqc on the 4 samples and outputs
its results to the results/ directory
Trimming
As we discussed in lecture, there are occasions where our sequencing reads are “contaminated” with portions of the sequencing adapter or the quality of the base calls is poor. We did not perform adapter or quality trimming on our reads in the last project since STAR performs local alignment and will automatically “trim” adapters by soft clipping reads (aligning only the portions that align well without enforcing end-to-end alignment). For this dataset, we will perform adapter trimming with trimmomatic, which will also trim low quality bases.
1. Create a snakemake rule that runs trimmomatic on the FASTQ files
- Use the provided adapter file in materials/project_2_chipseq/
- Include the following command and parameters:
- ILLUMINACLIP:{adapters}:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15
The official trimmomatic webpage will tell you to call the .jar file to run trimmomatic (i.e. java -jar trimmomatic.jar). Instead, you can also simply call it as such:
trimmomatic SE <...your other flags and parameters...>
It will also automatically detect the proper encoding and you do not need to specify -phred33.
Otherwise as mentioned, you may use the set of arguments listed above and the syntax they provided in the official documentation to structure the correct command.
Building a genome index
So far, I have provided you with pre-built genome indexes. If you recall, these indexes are tool-specific and are file(s) that allow for more efficient searching of alignments in large reference sequences. For this week, I will ask you to build your own complete genome index in bowtie2 for the provided human reference genome. For convenience, I have provided you with a fasta containing just the sequence of chromosome 21, which happens to be the smallest human chromosome. You may use this to quickly confirm that your bowtie2-build command is working as expected. Keep in mind, although this is the smallest chromosome, it is still 45 million basepairs long. When you are building this smaller index, you should also run it on a compute node, and not locally on the head node. It should still finish quickly enough to be of use in troubleshooting.
Keep the bowtie2-index generated from just chromosome 21 as you can use it next week when testing your alignment command.
1. Create a snakemake rule that runs bowtie2-build and generates an index
for the entire human reference genome
- Leave every parameter at default
bowtie2-build will create a series of six output files. You may track these
explicitly as your snakemake outputs or consider making use of touch()
. Just
remember the advantages and disadvantages of using touch()
.
Swap to the full data
We will talk more about the various bioinformatics databases that store sequencing data, but we will be using EMBL-ENA as they store ftp links to the full FASTA files for most data deposited into GEO.
1. Navigate to the EMBL-ENA website and search for the listed GEO accession
in the paper in the search bar.
2. Go to the "Project" page for this accession and look under the "Read Files"
section.
3. Match the SRR names found in the `sample_sheet.csv` with the files listed
here
4. Swap the ftp_links value in the CSV with the FTP links for the full files
from EMBL-ENA.
5. You will need to change some other columns in the `sample_sheet.csv` to
switch it from working on the subsampled files to full files. Full files
should not have the string "subsampled" in them.
6. Re-run your snakemake workflow
Week 1 Tasks
Read the paper and familiarize yourself with their main goals and the general results they published
Understand the new strategy we will be employing to better integrate conda environments and have snakemake utilize specific environments for specific rules
Write a snakemake rule that runs fastqc as specified
Write a snakemake rule that runs trimmomatic as specified
Write a snakemake rule that runs bowtie2-build and generates a full genome index for the human reference genome
Swap your FTP links in the sample_sheet.csv with the links to the full data you find on the EMBl-ENA website. You will need to change some other columns in the
sample_sheet.csv
to switch it from working on the subsampled files to full files. Full files should not have the string “subsampled” in them.
You should now have both the subsampled data as well as the full data processed
by your week 1 snakemake file. Going forward, you can use the subsampled data
to troubleshoot your code before swapping your snakefiles logic to run on the
full data. In order to swap between your subsampled and full files, you will
need to edit the columns in the provided sample_sheet.csv
appropriately.