Week 1: RNAseq

Week 1 Overview

A basic RNAseq analysis consists of sample quality control, alignment, quantification and differential expression analysis. This week, we will be performing quality control analysis on the sequencing reads, generating a genome index for alignment, and making a mapping of human ensembl IDs to gene names.

Objectives

  • Perform basic quality control by running FASTQC on the sequencing reads

  • Generate a delimited file containing the mapping of human ensembl ids to gene names

  • Use STAR to create a genome index for the human reference genome

Create a working directory for project 1

Accept the github classroom link and clone the assignment to your student directory in /projectnb/bf528/students/your_username/. This link will be posted on the blackboard site for our class.

Locating the data

For this first project, we have provided you with small subsets of the original data files. You should copy these files to the samples/ folder in your working directory that you cloned. For your convenience, we have renamed the files to be descriptive of the samples they represent following the pattern: {condition}{rep}{readpair}.subset.fastq.gz (i.e. control_rep1_R1.subset.fastq.gz)

The files are located here: /projectnb/bf528/materials/project-1-rnaseq/subsampled_files and there are 12 total files. Remember that paired end reads are typically used together as both the _R1 and _R2 files represent sequencing reads from the same fragments. This means that we have 6 biological samples, and 12 actual files representing those samples.

  1. Create a new directory in samples/ called subsampled_files/

  2. Using cp, make all of these files available in your subsampled_files/ directory

  3. In this same directory, /projectnb/bf528/materials/project-1-rnaseq/, you will also find a refs directory. Copy the files contained within to your refs/ directory. Encode their paths in separate params in your nextflow.config.

Changes to our environment management strategy and workflow

As we’ve discussed, conda environments are one solution to ensuring that your analyses are run in a reproducible and portable manner. Containers are an alternative technology that have a number of advantages over conda environments. Going forward, we will be incorporating containers specifying our computational environments into our pipeline.

One of the advantages of container technologies is that they usually offer a robust ecosystem of shared containers (images) that are available in public repositories for anyone to reuse. We have developed a set of containers for each piece of software that we will be using in this course.

For all future process scripts that you develop in nextflow, instead of specifying the conda environment to execute the task with, you will instead specify container and the image (a container built with a certain specification) location. For example:

process FASTQC {
  container 'ghcr.io/bf528/fastqc:latest'
  ...

}

For this course, these containers were all pre-built and their specifications kept in this public github repository: https://github.com/BF528/pipeline_containers. Feel free to look into the repo for how the containers were built. We follow a similar pattern where we specify the exact environment we would like created in a YAML file and then generate the environment using micromamba installed in a container. We will get some experience later in the semester with building your own containers from scratch.

In general, the containers will be named following the same pattern: ghcr.io/bf528/<name-of-tool>:latest or ghcr.io/bf528/fastqc:latest.

New command for running your workflow

You will need to incorporate one more change to make your pipeline use these containers. If you look in the nextflow.config, you’ll see a profile labeled singularity, which encodes some singularity options for nextflow to use.

When running your pipeline, you will now use the -profile singularity,local option, which will have Nextflow execute tasks with the specified container image listed in the module. The config contains some common options that nextflow will automatically add to the singularity command when run with the specified profile.

nextflow run main.nf -profile singularity,local

As always, remember to activate your conda environment containing nextflow and nf-test before doing any work.

Docker images for your pipeline

FastQC: ghcr.io/bf528/fastqc:latest

multiQC: ghcr.io/bf528/multiqc:latest

VERSE: ghcr.io/bf528/verse:latest

STAR: ghcr.io/bf528/star:latest

Pandas: ghcr.io/bf528/pandas:latest

Generating our input channels for nextflow

This RNAseq pipeline will be driven by two channels that contain the starting FASTQ files for each of the samples in the experiment. Nextflow has a built-in function to simplify the generation of channels for samples from paired-end sequencing experiments. The Channel.fromFilePairs channel function allows you to detect paired end fastq files for each sample using similar pattern matching in bash through the use of wildcard expansion (*).

  1. In the nextflow.config, specify a parameter called reads that encodes the path to your fastq files and uses * to flexibly detect the sample name associated with both paired files. Refer to the nextflow documentation here

  2. Change explanation of where to encode path with *

  3. In your workflow main.nf, use the Channel.fromFilePairs function and the param you created in step 1 to create a channel called align_ch. You’ll notice that this function creates a channel with a structure we’ve seen before: a tuple containing the base name of the file and a list containing the R1 and R2 file associated with that sample.

align_ch

[sample1, [sample1_R1.fastq.gz, sample1_R2.fastq.gz]]
  1. In your workflow main.nf create another channel using the exact logic from above but add an additional operation to create a channel that has as many elements in the channel as there are actual files (16). You may find information on common nextflow operators that will enable this here

Name this channel fastqc_channel and it should be a list tuples where the first value is the name of the sample and the second is the path to the associated file. This will look something like below:

fastqc_channel

[sample1, sample1_R1.fastq.gz]
[sample1, sample1_R2.fastq.gz]

Performing Quality Control

At this point, you should have:

  • Setup a directory for this project by accepting the github classroom link

  • Copied the subsampled files to your working directory, samples/

  • Familiarized yourself with the changes to environment management and how to run nextflow using containers

  • Generate two channels in your main.nf with the size and elements specified above

We will begin by performing quality control on the FASTQ files generated from the experiment. fastQC is a bioinformatics software tool that calculates and generates descriptive graphics of the various quality metrics encoded in a FASTQ file. We will use this tool to quickly check the basic quality of the sequencing in this experiment.

  1. In your envs/ directory, make a YML specification file to create a conda environment with the latest version of FASTQC installed. Use the same command as last week to generate a new environment with just fastqc installed:
conda env create -f envs/test_env.yml
  1. Activate this new environment after it’s created, and navigate to the temp/ directory. You can view the help information for FASTQC by calling the program in a terminal with fastqc -h or for some programs fastqc --help. Try to run FASTQC on the provided fastq file.

By default, FASTQC will create two output files from a single input named as seen below:

fastqc sample1.fastq.gz

Outputs:
sample1_fastqc.html
sample1_fastqc.zip
  1. Make a new process for fastqc in the modules/ directory. Be sure to specify the following in your module:
label 'process_low'
container 'ghcr.io/bf528/fastqc:latest'
publishDir <a param containing the path to results/>
  1. Specify the inputs of the process to match the structure of the fastqc_channel we just generated.For this module, please list two named outputs, which will later allow us to use the individual outputs separately:
output:
tuple val(name), path('*.zip'), emit: zip
tuple val(name), path('*.html'), emit: html

This output definition will instruct nextflow that both of these files should exist after FASTQC has run successfully. It will capture the html report and zip file separately and allow you to pass these different values through channels separately (i.e. FASTQC.out.zip would refer specifically to the zip file created by the FASTQC task)

Remember you can use wildcard expansions in the path output to flexibly detect files with certain extensions without specifying their full filename.

  1. The shell command should be the successful fastqc command you ran earlier. Ensure that the following argument is included in your actual fastQC command:
-t $task.cpus
  1. Incorporate the FASTQC process into your workflow main.nf and provide it the proper channel.

Generate a file containing the gene IDs and their corresponding human gene symbols

As we’ve discussed, it’s often more intuitive for us to use gene names rather than their IDs. While there are external services that can perform this for us (biomaRt), it is often better to extract this information from the GTF file associated with the exact version of the reference genome we are using. This will ensure that we are as consistent as possible with our labeling.

Whenever we need to perform operations using custom code, we are going to use the conventions established in project 0. We will place this script in the bin/ directory and make it executable. We will then create a nextflow module that will provide the appropriate command line arguments to the script.

  1. Generate a python script that parses the GTF file you were provided and creates a delimited file containing the ensembl mouse ID and its corresponding gene name. Please copy and modify the argparse code used in the project 0 script to allow the specification of command line arguments. The script should take the GTF as input and output a single text file containing the requested information.

  2. Create a nextflow module that calls this script and provides the appropriate command line arguments necessary to run it.

  3. You may use the biopython (ghcr.io/bf528/biopython:latest) or the pandas (ghcr.io/bf528/pandas:latest) container to run this task as both of these contain a python installation.

  4. Incorporate this module to parse the GTF into your workflow main.nf and pass it the appropriate GTF input encoded as a param.

0.6 Creating your own process labels

As we discussed in class, you can request various ranges of computational resources for your jobs / processes to use through the qsub command and its accompanying options.

Please refer to the following page for common combinations of options to request specific amounts of resources from nodes on the SCC.

  1. By now you have noticed that some modules have a label and you can see in the nextflow.config in the profiles section the exact qsub command and options each label requests.

  2. Create two new labels named process_medium and process_high. For process_medium, use the correct options to request 8 cores and <= 32GB ram For process_high, set the options to request 16 cores and <= 128GB ram.

You will need to add a clusterOptions line to the process_high to specify the additional flags.

Generate a genome index

Most alignment algorithms require an index to be generated to make the alignment process more efficient and expedient. These index files are both specific to the tool and the reference genome used to build them. We will be using the STAR aligner, one of the most commonly used alignment tools for RNAseq.

  1. Read the beginning of the documentation for how to generate a STAR index.

  2. Generate a module and process that creates a STAR index for our reference genome. This process will require two inputs, the reference genome and the associated GTF file. Ensure that the following are specified in your module:

label 'process_high'
container 'ghcr.io/bf528/star:latest'
  1. The outputs of this process will be a directory of multiple files that all comprise the index. You will have to create this directory prior to running the STAR command. Use all of the basic options specified in the manual and leave them at their default values. Be sure to also include the following argument in your STAR command:
--runThreadN $task.cpus
  1. Incorporate this process into your workflow and pass it the appropriate inputs from your params encoding the path to the reference genome fasta and GTF file.

Week 1 Tasks Summary

  1. Clone the github classroom link for this project

  2. Copy the necessary files to your working directory. These files will include the following:

  • Gencode hg38 reference genome for the primary assembly
  • Gencode GTF for the primary assembly
  • The subsampled FASTQ files
  1. Generate a nextflow channel called align_ch that has 8 total elements where each element is a tuple containing two elements: the name of the sample, and a list of both of the associated paired end files.

  2. Generate a nextflow channel called fastqc_channel that has 16 total elements where each element is a tuple containing two elements: the name of the sample, and one of the FASTQ files.

  3. Generate a module that successfully runs FASTQC using the fastqc_channel

  4. Develop an external script that parses the GTF and writes a delimited file where one column represents the ensembl mouse IDs and the value in the other column is the associated MGI symbol.

  5. Generate a module that successfully creates a STAR index using the params containing the path to your reference genome assembly and GTF