Week 4: Genome Analytics

Week 4 Overview

For the first weeks, you were provided pre-downloaded reference genomes to use in your pipeline. Oftentimes, you will want to include downloading any files inside your pipeline to enable it to run end-to-end without any manual intervention. For this week, you will make use of the NCBI datasets CLI tool to download a genome of your choice and incorporate this into your workflow.

Objectives

  • Explore and navigate through the NCBI genomes resource

  • Utilize the NCBI Datasets CLI tool and incorporate it into your workflow

  • Set up a jupyter notebook to work in VSCode with a specific Conda environment

  • Generate a Circos plot of your newly annotated genome

Explore the use of NCBI Datasets CLI

The NCBI Datasets CLI is the recommended means for accessing the comprehensive information provided by the NCBI. The NCBI has packaged together a web interface, API, and command line tool that enables retrieving comprehensive information about genomes, genes, and taxonomies. This tool makes it incredibly simple to download one or many genomes and any available associated annotation information desired.

  1. Go to the NCBI genomes website (https://www.ncbi.nlm.nih.gov/home/genomes/) and search for a microbe of your choice. Browse the search results and explore all of the information available to you and provided by the NCBI.

Generate a module that downloads a genome using the NCBI Datasets CLI

Navigate to the genomes section for your microbe of interest and if available, choose the assembly marked as “reference genome” for your chosen organism. The representative genome will be marked with a star as seen below:

If you look at the genome assembly page for your chosen genome, you should see an option for download using datasets. If you click on that option, you will be provided a template command for how to use the CLI tool to download various files associated with the chosen assembly.

Make sure to note the command as well as the particular assembly name for the genome (GCF). You will use this command and modify it so that you can provide any valid genome assembly (GCF) and to download only the genomic FASTA file.

  1. Create a CSV containing two columns with the scientific name of the organism you chose and its associated genome accession (GCF_*) listed in the datasets command.

  2. Create a channel in your nextflow workflow main.nf containing information found by reading from the CSV you created.

  3. Read the manual page for running the datasets command to figure out what file it outputs and what each of the options specified in the observed command are doing.

  4. Modify the command to only download the genomic FASTA file and structure the module to be generalizable to any genomic accession.

Refactor your workflow to incorporate downloading a genome as the first step

After you have successfully generated a module that will run the NCBI Datasets CLI command and download your requested genome FASTA, you will need to incorporate this process into your workflow.

  1. Add your NCBI Datasets CLI module into your workflow main.nf and have your pipeline use the outputs of this module directly. When run from the beginning, your workflow should seamlessly download the requested genome and perform the previously described processes (Prokka, GC content, etc.)

Make a conda environment with pycirclize and ipykernel

We will often analyse the results of our piplines extensively in a jupyter notebook as it allows us to seamlessly execute code, create visualizations, and provide text annotations in the same place. With larger projects and complicated analyses, we may often require many different software packages while working in our notebooks and it is incredibly easy for these notebooks to become unorganized jumbled messes of different packages. In the interest of reproducibility, we will directly create a conda environment that contains all of the software we require for our notebooks in the style we have been using in this class. This way if we ever needed to share the work done in these notebooks, we can simply provide the environment and the actual notebook separately.

With the Jupyter extension in VSCode, we can directly work with jupyter notebooks and also run these notebooks with specific conda environments active. The IPyKernel package will allow us to launch a python process as a kernel and execute code against a jupyter notebook. Kernels are simply programming language specific processes that run independently and interact with jupyter.

  1. Make a YML file in envs/ according to the conventions we’ve talked about that will create a conda environment with the latest versions of pycirclize and ipykernel installed. You may name it whatever is easy for you to remember.

  2. Use the conda env create -f envs/<your-env>.yml to actually create this environment.

  3. In VSCode, use the command palette to create a new jupyter notebook. Go to the top right where it says Select Kernel and choose the Python Environments... option and find the conda environment you just created. This will enable you to utilize all of the software packages installed within that environment in this notebook.

Generate a Circos plot of your new annotated genome using the GFF

If you were in BF591, we discussed briefly the purpose and usage of Circos plots. If not, you may look here for a short discussion on these circular layout plots. While they can be difficult to interpret when overlaying quantitative data, we will be making a Circos plot of the genome and genome annotations generated by Prokka on your genome of choice. A Circos plot works well here because most bacterial genomes are circular and we are only using it to illustrate the structure of the genome and the organization of the genes.

  1. In your notebook with your newly created conda environment as the kernel, use pycirclize and the GFF you generated from your genome to create a Circos plot of the genome and its annotations. Hint: You can simply copy the example code from pycirclize to make this plot.

Week 4 Detailed Tasks Summary

  1. Choose a prokaryotic genome and use the NCBI genomes site to retrieve the NCBI datasets CLI command to download a genome fasta file by specifying an assembly.

  2. Create a CSV containing the scientific name of your microbe and the assembly you wish to download from above

  3. Generate a nextflow module that will run the NCBI datasets CLI to download a genome of your choice

  4. Create a channel containing the information from the CSV and run the NCBI datasets CLI module using this channel

  5. Refactor the rest of your workflow main.nf to use the outputs from your NCBI datasets CLI command to run the complete pipeline

  6. Create a conda environment containing the latest versions of ipykernel and pycirclize

  7. Generate a Circos plot in a jupyter notebook using the GFF created by Prokka to create a simple visualization of the genome and genome annotations for your chosen microbe.