Lab 3 - Using the SCC and basic nextflow
Today we are going to extend the workflow we built last time by adjusting our qsub script to request a certain amount of cores and memory to use. We will then begin converting our basic workflow into a proper nextflow pipeline. I have provided you with the scripts we used last time in the bin/ directory.
Lecture on SCC Resources (15 minutes)
Iteration 1 (15 minutes)
Don’t repeat yourself (DRY) - using third party tools (15 minutes)
Oftentimes, we will want to perform basic tasks in a workflow that have already been implemented by someone else. DRY is a principle in software development that stands for “Don’t Repeat Yourself” and is meant to reduce the repetition of code and make it easier to maintain. We can apply this same idea to basic tasks in bioinformatics.
As it happens, Biopython is a well supported and maintained package that makes a large number of tools for computational molecular biology available to us. In particular, it has a number of functions for parsing bioinformatics data files, and performing common operations on them, such as calculating the length of a sequence.
Your turn
- Adjust the calc_length.py script to use biopython to calculate the length of the sequence in the fasta file.
Resources for Computational Environments
Class Lecture
Textbook Guide
- Create a conda environment in the envs/ directory called biopython_env.yml
Add a process for running your python script
Now we are going to add a process for running the python script on the downloaded file. Please add the following lines to your script right underneath the DOWNLOAD process and above the workflow block:
process CALCULATE_LENGTH {
conda 'envs/biopython_env.yml'
input:
path(fasta)
output:
path("length.txt")
script:
"""
calculate_length.py -i $genome -o length.txt
"""
}
Now in your workflow block, add the following line so that it resembles below:
workflow {
DOWNLOAD()
CALCULATE_LENGTH(DOWNLOAD().out)
}
Now try re-running your script with the following command:
nextflow run main.nf -profile conda
Find where the CALCULATE_LENGTH process executed and report the length of the genome in the length.txt file.
Iteration 2
By now, we have developed a very basic nextflow pipeline. However, we can
improve upon this by adding a few features to make it more robust and easier to use.
Make a new text file called improved_main.nf
and copy the contents of main.nf
into it.
Add values to your nextflow.config
You have already seen the nextflow.config file in the root of your directory. Notice in the params section, how we have defined a number of variables that we can use in our nextflow script. I have provided you with the value for the FTP link to download the genome we’ve been working with.
Your Turn:
- Adjust the process DOWNLOAD to reflect the following changes:
- Add an input to the process that takes in the ftp link (hint: use the
input
keyword and theval
keyword to pass the value in params) - Adjust the output to capture any
.fna.gz
files downloaded (hint: use the *` operator, it works the same as in bash) - Adjust the script block to use the value provided in the input
- Use the value in params (params.ftp_link) and pass it as an argument to the DOWNLOAD process.