Final Project - single cell RNAseq
By now, we will have gone through the basic workflow for a single cell dataset in-class. For the final project, you will be processing single cell RNAseq data from a published experiment on your own and focusing on some of the more common biological analyses performed on scRNAseq.
Your workflow will need to align the reads to an appropriate reference genome. We have provided you with a docker image containing the latest version of cellranger, but you may also explore other valid alignment strategies (STARsolo, etc.)
Once you have a successfully working pipeline, you will need to perform a basic analysis of the data in either Seurat or Scanpy. These directions will only give you a basic scaffold of what you need to do and the rest you will need to implement on your own. You will be asked to produce certain figures / plots and write a short discussion on what is shown, your interpretation of the findings and how it affects how you will proceed with your analysis. Please create a single jupyter notebook or Rmarkdown that contains both the required plots and answers to the discussion questions.
Preprocessing
The dataset consists of multiple samples from the same experiment, ensure that you perform these QC metrics for each sample. You may create separate plots for each or create integrated plots with all of the samples.
Figures
A plot containing the number of unique genes detected per barcode, the total number of molecules per barcode, and the percentage of reads that map to the mitochondrial genome.
An integrated plot that visualizes the above metrics jointly.
A table that displays how many cells / genes are present in your dataset before and after your filtering thresholds are implemented
Discussion
In your discussion, make sure to address the following points / questions:
- What filtering thresholds did you choose and how did you decide on them?
- How many cells / genes are present before and after implementing your filtering thresholds?
- What are some potential strategies to set thresholds that don’t rely on manual inspection and “eyeing” of plots?
Count Normalization
Discussion
Choose a method to normalize the scRNAseq data and ensure that you explain the exact normalization procedure.
Feature Selection
Choose an appropriate method for feature selection and the number of highly variable features you will use in downstream analyses.
Figures
- Create a plot of the top ten most variable features by your chosen metric.
Discussion
Explain the method used to determine highly variable features and report how many variable features you chose to use for downstream dimensional reduction.
Include a brief statement on how many genes meet your chosen threshold to be considered highly variable and how many are not considered
PCA
Perform dimensional reduction using PCA on your highly variable features.
Figures
- Create a plot that justifies your choice of how many PCs to utilize in downstream analyses
Discussion
- Justify your choice in writing and briefly remark on your plot.
Clustering and visualization
Decide on an appropriate method to cluster your cells and create a visualization of the 2D embedding
Figures
Create a plot visualizing the clustering results of your cells. Ensure that it has labels that identify the different clusters.
Create a plot visualizing the clustering results of your cells. Ensure that it has labels that identify which samples each cell originated from.
Discussion
Write a brief caption describing the plot visualizing the clustering of your cells
Use the second plot you created and briefly remark on whether you will perform integration.
Integration (optional)
If you determined from your clustering that the dataset would benefit from integration, please choose an appropriate method and apply it to your dataset.
Figures
- Create a plot visualizing the clustering results of your cells before and after integration. Ensure that it has labels that identify which samples each cell originated from.
Discussion
- Briefly remark on the plots and the effects you observe due to the integration
Marker Gene Analysis
Perform marker gene analysis using a method of your choice.
Figures
- Create a table listing the top ten marker genes for each of your clusters
Discussion
- Briefly describe the method performed to identify marker genes. Discuss a few advantages and disadvantages of the method used to perform marker gene analysis.
Automatic Annotation of Cell labels
Use an annotation algorithm to assign preliminary labels for the cell identities of your clusters. You may use your choice of tool as long as it has a documented code base hosted publicly and has an associated publication describing its use.
Figures
- Create a visualization of your cell clustering with the labels assigned by the results of running the labeling tool of your choice.
Discussion
Briefly remark on how the algorithm works and ensure you cite the original publication
Comment on the cell identities of your clusters and any speculation as to the original source tissue of the samples based on what you see
Manual Cluster Labeling
Perform literature searches using the marker gene analysis you performed before. In combination with the results from your cell annotation algorithm, provide a final determination of the cell identities for each of your clusters.
Figures
Create a single plot / figure that displays the top marker genes for every cluster in your dataset
For the top three clusters with the most cells, create individual plots of at least 5 of the defining marker genes for that cluster and their expression compared across all clusters
Create a visualization / plot of your clusters with your manual annotations and labels of cell identities.
Discussion
- Write a paragraph describing your cell annotations and explaining what cell label you have assigned to each cluster. Ensure that you have at least one citation that supports your labeling and choice of marker genes for each cluster.
Pseudobulk Analysis
As we’ve discussed, cells from the same sample are often more similar to each other than the same cell type from other samples. It has been empirically determined that commonly used statistical methodology from bulk RNAseq will often lead to inflated type I error rates when attempting to perform differential expression across populations of cells or samples without modification in scRNAseq.
Perform a pseudobulk analysis differential expression analysis between the WT and KO samples in the top three clusters with the most cells.
Figures
- Create a figure that displays the top results of your pseudobulk analysis for each of the three clusters.
Discussion
- Briefly remark on the changes in gene expression you observe in these cell populations between the WT and KO samples
Proportions of cell populations
This dataset only has X samples, but we can visualize the proportions of each cell cluster in all of our samples and look for potential trends.
Figures
- Create a plot that displays the proportion of each of your cell clusters across all of your samples
Discussion
- Briefly discuss what your plot shows, and what you might do further in the future if you wanted to be more confident in drawing conclusions about potential changes in cell proportions across your conditions.
Analysis of your choice
For the last part of the project, you may choose one additional analysis to perform on the data. You may choose one of the following analyses:
- Pseudotime
- RNA velocity
- Cell signaling
- Cell composition
You may choose a tool or algorithm with the same constraints as before (it must have a documented code base and an associated publication).
Figures
- Create a figure or figure(s) that display the most pertinent results from the analysis
Discussion
- Briefly remark on the results and why they may be interesting for this dataset.
Final Project Grading
The requirements for the final project are:
- A working nextflow pipeline that uses all of the principles and strategies discussed in class
- A single jupyter notebook or Rmarkdown that performs the above tasks, creates the required plots / visualizations and addresses the listed questions
The responses to the questions should be written in the style of a scientific publication, but you will be graded on the content. Each question can be answered in as little as a paragraph so we encourage you to be brief.