Week 1: Genome Analytics

Any tasks that you will be asked to do will be wrapped and styled as this section. If you are asked to answer any questions or report findings, please do so in the included docs/week#_tasks.Rmd.

All other text will be directions or other explanatory text that you should read.

Project 0 Overview

This project will focus on getting you accustomed to how nextflow works and its key functions. You will be constructing a small nextflow pipeline that will annotate a small bacterial genome using Prokka, calculate some basic genome statistics using python (GC Content, and genome size) and extract the sequence for a specified region of the genome by its coordinates. The general format and structure of this repository and nextflow pipeline will be utilized for all other projects.

Objectives

  • Understand how to encode environment specifications for conda in YML files
  • Gain experience understanding how to run Nextflow and create channels
  • Acclimate to the structure and style of projects as workflows

Getting Started

  1. Accept the provided github classroom link for project 0 on blackboard and clone it to your directory under the bf528 class space (/projectnb/bf528/students/username/)

  2. Please read through the provided README.md, which explains the structure of the directory we are going to be working with throughout the semester.

Setting up our environment

Before we begin working with Nextflow, we will need to install it. For this, we will be using Conda to create an environment where we just have Nextflow (and its direct dependencies) installed. For all project work in this class, you will need to remember to activate this environment as soon as you begin.

Following the practices recommended here, we will be creating this conda environment directly from a YML file and this file will be version controlled and included in the analysis repository. Using VSCode, navigate to the envs/ directory, open a new file and name it base_env.yml. Below see an example of these YML files should look like when used to create conda environments:

name: name_of_env
channels:
- conda-forge
- bioconda

dependencies:
- desired_package=version_number

A few things to remember:

  • The order of channels is important! Developers and maintainers of Bioconda (the channel where most bioinformatics packages are hosted) have agreed that conda-forge should be listed first, and bioconda second. This is the channel priority, and conda uses this internally to ensure that it solves dependencies correctly.

  • You should specify a single package or tool in separate YML files. You should directly pin the exact version you desire to ensure transparency and reproducibility.

  • Conda will take care of the minor software dependencies, and you generally do not attempt to manually install specific versions of those.

  • The value after name will be the name of your conda environment and the name you use to activate it and make the packages and software available to you for use.

  1. For our purposes, you will be using a slightly older version of Nextflow to ensure that these directions do not get out of sync. Create a YML file in the envs/ directory named base_env.yml and specify that you wish to install Nextflow version 24.04.2. Remember that the value after name in the file will be the name of the conda environment created, you may name this whatever is easiest for you to remember. We will be using this single environment all semester.

  2. Separately, using either conda search -c conda-forge -c bioconda <package-name> or by navigating to the tool’s website / github, report the most recent version available for Nextflow in the provided notebook.

  3. Answer the following questions in the provided notebook:

    1. What is the advantage of using the most up-to-date versions of software?

    2. What are some disadvantages?

  4. Open a terminal in the top-level of your directory, use the conda env create -f envs/base_env.yml command to create the environment described in your YML file. Remember that for all work in this class, you will want to activate this environment as soon as you log into your VSCode session before you start working. To do this, you will have to remember to perform the following steps in your VScode terminal:

    1. module load miniconda
    2. conda activate name_of_nextflow_env

Adding reference files to our directory

If you look in , you have been provided with a FASTA file of an unknown bacterial genome.

  1. Pick one of the three available genomes and copy a single FASTA file representing an unknown bacterial genome to your refs/ directory.

Generating a samplesheet

For all of our projects, we will be encoding all of the information including sample metadata, and sample filepaths in a single CSV file. We will then use Nextflow to read the information stored within this sample sheet to drive the workflow and specify what files we want it to operate on.

This pattern of encoding the samples in this way offers two key benefits:

  1. We can easily see all of the files associated with a particular analysis

  2. We can generalize our workflow and apply it to a different set of samples by simply supplying a different CSV containing the information for those samples.

  1. Using VScode, manually make a new text file at the top-level of your directory (project-0-genome-analysis) and generate a simple CSV named samplesheet.csv with two columns named name and path and one row of data which specifies the name of the genome you picked (look at the filename or the first line in the file) and the path to where it is located relative to the top-level directory (refs/your_genome_name.fna)

  2. Next open the nextflow.config file, and under params add a value underneath the “//Reads and references” comment that encodes the path to your samplesheet. You may use the Nextflow shorthand $projectDir to refer to the working directory (top-level).

Creating our first nextflow channel and process

Please refer to the nextflow lecture for more details on this section.

Nextflow channels have a much more technical definition, but you may think of them as a way in which nextflow sends data through a workflow. Channels are connected to processes through the inputs and outputs of the different steps of the workflow.

For our purposes, the nextflow channels we create will often follow a similar pattern of containing a set of files associated with a single sample, and associated metadata. Our channels will contain the file(s) and accessory information needed for whatever process or tool is being used. Often, this accessory information will be sample identifiers, which will be used to name output files or specify options during runtime.

As you can see from the samplesheet you generated, our first nextflow channel will contain two values, the name of the genome and its file path. This will allow us to pass this information to our desired processes for further analysis or manipulation.

To run nextflow, simply make sure that your environment with nextflow installed is activated and then use a terminal to enter the following command in the directory where your nextflow script (main.nf) is located:

nextflow run main.nf -profile local,conda
  1. Take a look at the nextflow.config and look for the profile section. Answer the following question in the provided notebook:

    1. What do you think the option -profile local,conda is doing?
  2. Using VScode, open the main.nf file and begin by adding our first channel in the workflow declaration. This channel should read the information from the samplesheet and creates a tuple containing the value contained within name and path in that order.

  3. This will require a combination of nextflow functions and operators including: Channel.fromPath, splitCsv, map, tuple and set. You can also use view to see the contents of a generated channel.

  4. Your channel when observed using view should look something like: [name_of_bacterial_genome, /path/to/reference/genome.fna]

  5. When you have successfully created this channel, you should use the set function to assign it to a variable. Name this variable fa_ch. set should replace the usage of view, which does not assign your channel to a variable but simply prints its contents to standard out.

Week 1 Tasks Summary

  1. Cloned the github classroom link for project 0 to your student directory in the bf528 project space.

  2. Read through the README.md, which describes the directory structure we will be utilizing throughout the semester.

  3. Generated a YML file in the envs/ directory that contains the specification for your base environment with Nextflow version 24.04.2 installed.

  4. Copied a single genome fna (FASTA) to your refs/ directory.

  5. Made the samplesheet.csv file at the top-level of your directory containing the columns name and path with one row being the name and path for your single genome file.

  6. Edited the nextflow.config file and added a param containing the path to your samplesheet

  7. Generated your first nextflow channel, fa_ch, containing a tuple with the name and path stored in your samplesheet.