Conda
Overview
Conda is a package manager and environment manager that allows you to manage your packages and environments. Conda enables you to create isolated environments with particular versions of packages installed. Conda also handles the dependencies of packages automatically, ensuring that you have the packages and most importantly, the right versions of the packages you need to run your code.
Conda Channels
Conda channels are repositories of packages that can be installed using conda. Developers and maintainers will often contribute their packages to channels so that others can easily install them.
In bioinformatics, there are two main channels we will be utilizing:
- conda-forge
- bioconda
You will often need to specify these channels when installing bioinformatics packages via conda to prevent conflicts with other packages or dependencies.
N.B. The order of these channels is important! The order determines the priority of the channels and where conda will attempt to install the packages from first. In bioinformatics, conda-forge should always be specified first, followed by bioconda.
How we will use conda
For the most part, we will be allowing Nextflow to handle the conda environments for us. We will simply define the environments we want created in a YML file and Nextflow will handle the rest. However, it is useful to know a few basic features and commands for conda.
You will always be using a single conda environment that has Nextflow installed whenever you are working on the SCC.
Basic Conda Workflow on the SCC (we will not often use this)
Although I have used the term conda, we will actually be using miniconda on the SCC. Miniconda is a stripped-down version of anaconda/conda that only includes the basic functionality we need to manage our environments. Miniconda on the SCC is available via a module.
The basic workflow for using miniconda is as follows:
- Load the miniconda module
- Create a conda environment
- Activate the environment
Step 1: Load the miniconda module
module load miniconda
This will load the miniconda module and make the conda commands available to you.
Step 2: Create a conda environment
conda create -n nextflow_env -c conda-forge -c bioconda nextflow
This will create a new conda environment named nextflow_env
with Nextflow installed.
Conda will also determine any needed dependencies and install them automatically.
Step 3: Activate the environment
conda activate nextflow_env
This will activate the environment and make the environment with Nextflow installed
available to you. To the left of your terminal prompt, you should see the name of the
environment you are using (e.g. nextflow_env
).
Basic Conda Workflow on the SCC (use this one)
- Load the miniconda module
- Activate your environment with Nextflow installed
module load miniconda
conda activate <name of nextflow environment>
Once you’ve done this, you will simply run Nextflow and as we will discuss later, it will handle the rest for you.
Conda conventions for this class
In this class, you will always define a conda environment by creating a YML file. This practice will make it transparent exactly what packages and versions were used in each analysis. The YML files essentially contain some of the same information seen in the above conda create command.
You can see an example of a YML file here:
name: nextflow_env
channels:
- conda-forge
- bioconda
packages:
- nextflow=24.04.6
You can then create the environment using the following command:
conda env create -f nextflow_env.yml
This will create an environment just as the above command would, but with the added
benefit of transparency and documentation of the exact command run. You will
also notice that we have specified a version of Nextflow to install using the =
.
We will always be specifying the exact version of the package we want to install.
You can see the same information in this YML file as we saw in the conda create command
and if you needed, you could manually activate this environment after it’s made just
as you would with the basic conda workflow.
We will store these YML files in our GitHub repositories which allows anyone to see what environments were used in our analysis and would enable them to easily recreate these environments by running the same command we used.
Conda and Nextflow
Most workflow managers allow you to define separate environments for each of your processes. Nextflow in particular allows you to define conda environments with YML files and specify a different conda environment for each of your processes. Nextflow will automatically handle building the environment and activating it appropriately for each process.
Whenever you sign on to work on the SCC, you will need to load the miniconda module and activate your single conda environment that only has Nextflow installed.
For your pipelines, you will define YML files that describe your conda environments
for each task and refer to them in the nextflow process. When Nextflow is invoked
with the -profile conda,<profile_name>
flag, it will automatically take care
of the rest if you’ve specified conda environments in your processes.
Conda Commands
Create an environment from a YML file
conda env create -f <env_name>.yml
Create a new environment
conda create -n <env_name> -c conda-forge -c bioconda <package_name>
List all environments
conda env list
Remove an environment
conda env remove -n <env_name> --all
List packages in an environment (environment must be activated)
conda list
Activate an environment
conda activate <env_name>
Deactivate an environment
conda deactivate
Search for available packages
conda search -c conda-forge -c bioconda <package_name>
You will often use this command to see what versions of a package are available, this is also the command that will show you which version you want to specify, which will typically be the latest.
You may also find information on software package versions by looking at the tool’s documentation (often on GitHub or the tool’s website).