Week 4: Genome Analytics

Week 4 Overview

For the first weeks, you were provided pre-downloaded reference genomes to use in your pipeline. Oftentimes, you will want to include downloading any files inside your pipeline to enable it to run end-to-end without any manual intervention. For this week, you will make use of the NCBI datasets CLI tool to download a genome of your choice and automatically process it using the workflow you have previously developed.

Objectives

  • Explore and navigate through the NCBI genomes resource

  • Utilize the NCBI Datasets CLI tool and incorporate it into your workflow

  • Set up a jupyter notebook to work in VSCode with a specific Conda environment

  • Generate a Circos plot of your newly annotated genome

Explore the use of NCBI Datasets CLI

The NCBI Datasets CLI is the recommended means for accessing the comprehensive information provided by the NCBI. The NCBI has packaged together a web interface, API, and command line tool that enables retrieving comprehensive information about genomes, genes, and taxonomies. This tool makes it incredibly simple to download one or many genomes and any available associated annotation information desired.

  1. Go to the NCBI genomes website (https://www.ncbi.nlm.nih.gov/home/genomes/) and search for a microbe of your choice. Browse the search results and explore all of the information available to you and provided by the NCBI.

Generate a module that downloads a genome using the NCBI Datasets CLI

Navigate to the genomes section for your microbe of interest and if available, choose the assembly marked as “reference genome” for your chosen organism. The representative genome will be marked with a star as seen below:

If you look at the genome assembly page for your chosen genome, you should see an option for download using datasets. If you click on that option, you will be provided a template command for how to use the CLI tool to download various files associated with the chosen assembly.

Make sure to note the command as well as the particular assembly name for the genome (GCF). You will use this command and modify it so that you can provide any valid genome assembly (GCF) and to download only the genomic FASTA file.

  1. Copy your previous workflow main.nf to a new file called complete.nf. This will be the new workflow that you will be modifying for this week.

  2. Copy the provided CSV (/projectnb/bf528/materials/project-0-genome-analytics/ncbi_samplesheet.csv) and add the genome you chose and its associated genome accession (GCF_*) listed in the datasets command.

  3. Create a channel in your nextflow workflow complete.nf containing information found by reading from the CSV you created.

  4. Read the manual page for running the datasets command.

  5. Complete the provided NCBI datasets module by using the correct command for downloading files using the datasets download genome command. You only need to download the genome FASTAS and do not need the other files.

  6. By default, the NCBI datasets command will download a .zip file containing several nested directories containing metadata files and the requested files. Include a second shell command after your NCBI datasets command that unzips the file.

  7. Examine the provided output for this module and note how it contains a tuple with the metadata value passed in the input, and the .fna file for your selected accession. Remember that we can use the * glob pattern matching syntax in the output to flexibly capture different files and ** to recurse through directories. Note that because of the staging directory strategy, the paths for this file are all relative.

Refactor your workflow to incorporate downloading a genome as the first step

After you have successfully completed the module that will run the NCBI Datasets CLI command and download your requested genome FASTA, you will need to incorporate this process into your workflow complete.nf.

  1. Add your NCBI Datasets CLI module into your workflow complete.nf and have your pipeline use the outputs of this module directly. When run from the beginning, your workflow should seamlessly download the requested genome and perform the previously described processes (Prokka, GC content, etc.). You should accomplish this by making it so you can pass the outputs of your NCBI Datasets CLI process directly to the downstream processes in your workflow complete.nf. Remember to make sure that the outputs of your NCBI datasets CLI module match what is expected by the other modules’ inputs.

  2. Remember that we can use the NAME_OF_PROCESS.out command to pass the outputs of a specific process to another. Ensure that the output channels match what is expected by the inputs for the other processes.

Make a conda environment with pycirclize and ipykernel

We will often analyse the results of our pipelines extensively in a jupyter notebook as it allows us to seamlessly execute code, create visualizations, and provide text annotations in the same place. With larger projects and complicated analyses, we may often require many different software packages while working in our notebooks and it is incredibly easy for these notebooks to become unorganized jumbled messes of different packages. In the interest of reproducibility, we will directly create a conda environment that contains all of the software we require for our notebooks in the style we have been using in this class. This way if we ever needed to share the work done in these notebooks, we can simply provide the environment and the actual notebook separately.

With the Jupyter extension in VSCode, we can directly work with jupyter notebooks and also run these notebooks with specific conda environments active. The IPyKernel package will allow us to launch a python process as a kernel and execute code against a jupyter notebook. Kernels are simply programming language specific processes that run independently and interact with jupyter.

  1. Make a YML file in envs/ according to the conventions we’ve talked about that will create a conda environment with the latest versions of pycirclize and ipykernel installed. You may name it whatever is easy for you to remember.

  2. Use the conda env create -f envs/<name_of_yml_from_above>.yml to actually create this environment.

  3. In VSCode, use the command palette to create a new jupyter notebook. Go to the top right where it says Select Kernel and choose the Python Environments... option and find the conda environment you just created. This will enable you to utilize all of the software packages installed within that environment in this notebook.

Generate a Circos plot of one of your new annotated genome using the GFF

If you were in BF591, we discussed briefly the purpose and usage of Circos plots. If not, you may look here for a short discussion on these circular layout plots. While they can be difficult to interpret when overlaying quantitative data, we will be making a Circos plot of the genome and genome annotations generated by Prokka on your genome of choice. A Circos plot works well here because most bacterial genomes are circular and we are only using it to illustrate the structure of the genome and the organization of the genes.

  1. In your notebook with your newly created conda environment as the kernel, use pycirclize and the GFF you generated from one genome to create a Circos plot of the genome and its annotations. Hint: You can simply copy the example code from pycirclize to make this plot.

Week 4 Detailed Tasks Summary

  1. Choose a prokaryotic genome and use the NCBI genomes site to retrieve the NCBI datasets CLI command to download a genome fasta file by specifying an assembly.

  2. Create a CSV containing the scientific name of your microbe and the assembly you wish to download from above

  3. Generate a nextflow module that will run the NCBI datasets CLI to download a genome of your choice

  4. Create a channel containing the information from the CSV and run the NCBI datasets CLI module using this channel

  5. Modify your complete.nf to use the outputs from your NCBI datasets CLI command to run the full pipeline

  6. Create a conda environment containing the latest versions of ipykernel and pycirclize

  7. Generate a Circos plot in a jupyter notebook using the GFF created by Prokka to create a simple visualization of the genome and genome annotations for your chosen microbe.