Project 2: Transcriptional Profile of Mammalian Cardiac Regeneration with mRNA-Seq
High-throughput sequencing technologies have helped researchers expand their understanding of genome-wide gene expression through their ability to detect potentially any mRNA molecule in a sample, rather than measuring only molecules with specific sequences as with microarrays. This agnostic approach to molecular profiling allows us to ask different and somewhat more complex questions of a sequencing dataset than a microarray dataset, (e.g. novel splice patterns, lincRNAs, coding mutations, etc), but the most basic information, mRNA abundance, is analogous to microarray expression measurements and extremely useful. In this project, you will download, QC, process, and analyze sequencing data that was generated to better understand how neonatal mice are able to regenerate their heart tissue but lose this ability later in development.
Upon completion of assignment 2, students will be able to do the following:
- Download and extract sequencing reads from publicly available mRNA-Seq datasets
- Understand the most common short read sequencing file formats and which available tools are useful in analyzing sequencing data
- Align and normalize sequencing reads to an appropriate genome using the tuxedo suite of short read mapping tools
- Compute and evaluate quality control metrics for sequencing datasets
- Examine differential expression identified with cuffdiff and interpret the results using DAVID
Contents:
- 1. Read the paper and create a workflow diagram for your role
- 2. Data acquisition & transfer to remote server
- 3. Quality control FASTQ files
- 4. Aligning and QA using tophat and RSeQC
- 5. Quantifying gene expression with cufflinks
- 6. Identifying Differentially Expressed Genes Associated with Myocyte Differentiation
- 7. Biological Interpretation
- 8. Discuss Your Findings
- Assignment Writeup
1. Read the paper and create a workflow diagram for your role
Everyone
In project 1, we gave you a simple, conceptual workflow diagram that outlined the major steps of the analysis and their dependencies (order). For this project, we ask that you generate one of your own that will be included in the written report. Individually, read through your assigned role and create a diagram for your section. For your report, you will work as a group to connect all of your sections’ diagrams into one overarching project workflow. These diagrams should include all major steps of the analysis and be linked in a manner that demonstrates their dependencies on other steps. Keep these diagrams simple with boxes representing what analysis / step is occurring and arrows or lines showing the dependencies between steps. Do not worry about making these diagrams aesthetically perfect, you may quickly make such diagrams in PowerPoint or any other appropriate software of your choosing. You only need to include the full workflow diagram detailing the entire project in your report.
It might be helpful when you are first starting to individually make diagrams that include all of the steps for your role in order to understand the process you will be following (even the non-necessary steps). Then, when you are working in a group to combine them, you can come to a consensus as to which steps are superfluous for the sake of reproducing the results and which are essential.
Please note that there is an inherent level of subjectivity in the generation of these diagrams. We are not looking for any one specific diagram / diagram structure. We only care that you represent the major steps of the analysis and that you connect them in a way that illustrates the order in which they must be performed. Our intent with the generation of these diagrams is to help you visualize and understand all of the key steps that go into a particular analysis and how your role connects and depends on the others.
2. Data acquisition & transfer to remote server
Lead role: Data Curator
As in the last study, all but one of the samples for this study have been downloaded and prepared for you. You will download the one remaining sample and process it.
- Download the sample
GSM1570702 <https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1570702>
_ (vP0_1) from GEOSeries GSE64403 <https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE64403>
_ to one of your directories on SCC. The filename you should download is SRR1727914.sra and it is 1.1Gb. The file is in SRA (short read archive) format. - Use the command
module load sratoolkit
to make the SRA tools available. Read theSRAtoolkit documentation <https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc>
_ to find the correct command to use to extract the SRA format to FASTQ. This is a paired-end sequencing run, so you will have two FASTQ files. You may find it convenient to rename the SRA file toP0_1.sra
before converting to FASTQ. Create a filerun_extract.qsub
, and put all code for extracting the FASTQ files, including the module command from above, into it and submit it as a batch job using theqsub
command. You should use the file/project/bf528/project_2/scripts/qsub_skel.qsub
as a template. - Once you have extracted the short reads to FASTQ format, inspect the files
with
head
and ensure the format is what you expect. Also ensure that the text in the header fields of the records are exactly the same between the two files, i.e. the first read in the first FASTQ file should be mated with the first read in the second file.
3. Quality control FASTQ files
Lead role: Data Curator
Now that we have FASTQ files, we will begin processing them and extracting quality measures. We will use the FastQC package available on SCC to do this.
- Use the command
module load fastqc
on the cluster to make the FastQC tool available on the command line. - By default, FastQC is run with a graphical user interface (GUI), but we are
going to run it in command line mode. Do this by simply running the
fastqc
command with the FASTQ files you extracted in the previous step as arguments. You may runfastqc -h
to output how to use the fastqc command. You may find the-o|--outdir
command line argument helpful. - The tool will create a number of html and image files of the statistics.
You can look at the html files if you choose using
head
,cat
, or your favorite command line text editor (nano
,emacs
,vim
, etc). You will not be able to view the images directly on the cluster, so you must download them. Download both the images and the html files to your desktop/laptop computer using a compatible SFTP application (e.g. WinSCP on Windows, RBrowser or Cyberduck on Mac). - Examine the output of FastQC by opening the html files in a web browser. Observe if any of the quality metrics failed.
- Report the results of the metrics read quality, GC content bias, sequence duplication levels, and any over-represented sequences you think might indicate a problem with the data in your report.
Deliverables:
- A summary of the quality of the dataset as assessed by fastqc
- A description of any anomalies, or questions about data quality
- Any plots that support the above
4. Aligning and QA using tophat and RSeQC
Lead Role: Programmer
Next we will align the reads to the mouse genome reference called mm9. The
reference files are available on the cluster at
/project/bf528/project_2/reference
. Included in this directory are the mm9
FASTA file (mm9.fa) as well its Bowtie2 indexes. Algorithms that align millions
of reads onto a genome always require indexes of the reference sequence. The
developers of TopHat host several of the most commonly used indexes which can
be found here (http://ccb.jhu.edu/software/tophat/igenomes.shtml). For your
convenience, we’ve downloaded the mm9 indexes you’ll need to run TopHat. They
are available in the same folder listed above.
To help you get started writing your code, we have downloaded the fastq files
from the other timecourse samples in this study under
/project/bf528/project_2/data/samples/
. Pick one of the samples and use the
FASTQ files when writing your alignment and RSeQC scripts. The samples from P0
can then be swapped in later as they become ready.
Use the command
module load samtools bowtie2 boost tophat
to load the alignment utilities we will need.samtools
andboost
are dependencies of TopHat. We will use samtools in subsequent steps. Create a filerun_tophat.qsub
and put all code for running tophat, including the module command from above, into it. You should use the file/project/bf528/project_2/scripts/qsub_skel.qsub
as a template.Construct a command that will align the two FASTQ files you extracted in part 2 against the mm9 reference using TopHat. Running
tophat
without any arguments will print out how to use it. The arguments in the paper are not formatted well, copy the arguments from/project/bf528/project_2/scripts/tophat_args.txt
into your tophat command.TopHat is a program that requires a substantial amount of memory to run. As a result, it is usually necessary to run TopHat as a batch job on the cluster rather than running it interactively. Once you have written your tophat command into the qsub file, submit the TopHat job using
qsub run_tophat.qsub
and wait for the alignment to complete. To check the status of your job, issue the commandqstat -u your_user_name
. For more information about submitting jobs to the queue, please see the SCC documentation page. NB: tophat may take more than an hour to finish.A file named
P0_1_tophat/accepted_hits.bam
will be created when TopHat has successfully run to completion. The file is in BAM format, which is a binary version of the SAM (Sequence Alignment/Map) format, and contains all of the original reads plus any alignments discovered by TopHat. Run the commandsamtools flagstat P0_1_tophat/accepted_hits.bam
in the directory where you ran TopHat and examine the results. Report these results in your writeup.-
Use the command
module load python samtools rseqc
to make the RseQC utilities available. You will first need to index the BAM file usingsamtools index accepted_hits.bam
before you can run these commands. After the BAM file has been indexed, and the fileaccepted_hits.bam.bai
exists, run the following first without arguments to see their usage and then withaccepted_hits.bam
as input:geneBody_coverage.py inner_distance.py bam_stat.py
Each of these utilities outputs a different quality control metric, and two create plot images which must be downloaded to view. Examine the metrics output by these three tools and explain and interpret them in your report. When a tool requires a bed file as input, use
/project/bf528/project_2/reference/annot/mm9.bed
.
Deliverables:
- A report of the total number of reads, number of mapped, unique, multimapped, and unaligned reads with percentages of total reads for each
- Plots and interpretation of the RSeQC output
5. Quantifying gene expression with cufflinks
Lead Role: Programmer
Cufflinks is a tool that counts how reads map to genomic regions defined by an
annotation, often a gene annotation like the one we have provided at
/project/bf528/project_2/reference/annot/mm9.gtf
.
Feel free to use the accepted_hits.bam
file you created above for the
alternate sample to develop your scripts if P0 is not ready yet.
- Run the command
module load cufflinks
to make the program available to run, and runcufflinks
without any arguments to see usage. Create a filerun_cufflinks.qsub
and put all code for running cufflinks, including the module command from above, into it. You should use the file/project/bf528/project_2/scripts/qsub_skel.qsub
as a template. - Construct a command that will run
cufflinks
on the fileP0_1_tophat/accepted_hits.bam
. The arguments in the paper are not formatted well, copy the arguments from/project/bf528/project_2/scripts/cufflinks_args.txt
into your cufflinks command. When you have created the appropriate command, submit your script as a batch job usingqsub run_cufflinks.qsub
. - When
cufflinks
has completed, the fileP0_1_cufflinks/genes.fpkm_tracking
contains the quantified alignments in FPKM for all genes. Load this file into R usingread.table
, and create a histogram of the FPKM values. You may consider removing the genes that have an FPKM value of zero. - You have downloaded and processed one of the four samples involved (each of
P0
andAd
have two replicates). We have prepared the remaining samples (P0_2, Ad_1, Ad_2) for analysis under/project/bf528/project_2/data/samples
. The authors usedcuffdiff
, a tool in the cufflinks suite, to identify differentially expressed genes. We have created a script that runs cuffdiff for you, though you will have to copy and modify it to use the alignments you produced in the previous steps. The qsub script you should copy is/project/bf528/project_2/scripts/run_cuffdiff.qsub
.
Deliverables:
- Histogram of the FPKM values for all genes
- A report on the number of genes you identified in the analysis, including any filtering (e.g. FPKM value thresholds, etc) you used
6. Identifying Differentially Expressed Genes Associated with Myocyte Differentiation
Lead Role: Analyst
There were many experiments in this study. We are going to reproduce the
comparison of postnatal day 0 (P0) versus Adult from Figure 1B. The output
from cuffdiff in 5.4 should have produced the file cuffdiff_out/gene_exp.diff
.
This file contains the differential expression statistics comparing the two
conditions. We will examine this file to interpret the results.
To help you get started writing your code before the results from earlier steps
are ready, we have provided a differential expression analysis from P4 vs P7 in
the file /project/bf528/project_2/data/P4_vs_P7_cuffdiff_out/gene_exp.diff
.
You may use this file to develop your code and then swap in the analysis from
above when it is ready.
Load the file
cuffdiff_out/gene_exp.diff
into R. Sort the data frame so that the smallestq_values
are at the top (hint: look at theorder
function). Produce a table of the top ten differentially expressed genes, with their names, FPKM values, log fold change, p-value, and q-value in your report.Produce a histogram of the
log2.foldchange
column for all genes with thehist
function. Try specifying different values for the breaks argument to control the number of bars in the plot and pick one you like best.Create a new data frame that contains only the genes where the last column, named
significant
, is equal toyes
. Thesubset
function is useful for this kind of task, for example ifdf
is a data frame with a column namedA
, then we can writedf.sub <- subset(df,A==”x”)
. Note the nominal \(p\)-value and \(q\)-value that these genes have in your report.Create a second histogram of the log2 fold change values only for significant genes. What do you notice?
Further subset the significant gene data frame you just created into two separate data frames with only the up- and down-regulated genes using the
log2.foldchange
column. Include the number of up and down regulated genes in your report.Using the
write
function, write out the up- and down- regulated gene names to separate files. The files should have one gene name on each line (leave genes that are separated by commas on the same line as is). Copy these files to your laptop or desktop, as we will be uploading them to a web application to perform gene set analysis.-
DAVID Functional Annotation Clustering groups gene sets based on the genes they share. The output of this tool attempts to organize the enriched gene sets into functionally related clusters. The Score attribute of the clusters is equal to the
-log10(average p-value)
from enriched gene sets. In a web browser, go to https://david.ncifcrf.gov/summary.jsp .- On the left of the page, look for the Upload tab, and upload the up regulated gene list file, or copy and paste the gene list into the first text box.
- Choose OFFICIAL_GENE_SYMBOL from the dropdown
- Mark the Gene List radio button
- Press Submit
- On the next page (Annotation Summary Results), pick Mus Musculus from the box on the left and press Select Species.
- On the page to the right, click Clear All, then expand the Gene_Ontology group and select GOTERM_BP_FAT, GOTERM_MF_FAT, and GOTERM_CC_FAT boxes
- Click the Functional Annotation Clustering button
- A window should appear containing a list of enriched GO terms organized into clusters based on functional relatedness
- Examine these results, and save them to file if you wish
- Follow this same process for the down regulated gene list
Deliverables:
- A table of the top 10 differentially expressed genes and statistics from 6.1
- A histogram of the log2 fold changes
- A report of the number of differentially expressed genes detected as significant from 6.3, and the numbers of up- and down- regulated genes at this significance level
- Two csv files containing the up- and down- regulated significant genes from 6.6
- A table summarizing the top cluster results from the DAVID analysis in 6.7
7. Biological Interpretation
Lead role: Biologist - for 4 person groups only
The authors used the FPKM tables to make biological interpretation of their experiment. We will follow their example and analyze the content of the FPKM expression matrices for biological patterns.
To help you get started writing your code, we have provided a FPKM table of
just the other 7 samples, not including P0, and a list of DE genes from P4
vs P7 in /project/bf528/project_2/data/fpkm_matrix.csv
and
/project/bf528/project_2/data/P4_vs_P7_cuffdiff_out/gene_exp.diff
. After
you have written this code, it should be easy to substitute in the full
results with P0 FPKMs and the P0 vs Ad DE genes when they are ready.
- Figure 1D contains plots of genes specific to the most prominent GO terms
discovered in the analysis. Using your
genes.fpkm_tracking
FPKM table and those you find in the different sample directories under/project/bf528/project_2/data/samples/
, compare the trends you see in your data with those in the paper. I.e. do you observe the same direction and magnitude of effect for the P0 and Ad samples as reported in the plots? - Compare the results you obtained from the DAVID analysis in 6.7 with those reported in the paper. Further develop the table you created for 6.7 to include annotations for those biological pathways that showed overlap with the results reported in the paper. E.g. include a second column for the manuscript scores, or an asterisk indicating which processes are in common.
- Create a FPKM matrix of all 8 samples by concatenating the
FPKM
columns from each of the tracking tables from 7.1 together into a single dataframe. Subset this matrix by at most the top 1k genes found to be differentially expressed between P0 and Ad from 5.4 and create a clustered heatmap like the one you made in project 1. Genes should be along rows and samples along columns, with dendrograms and labels shown. Compare the heatmap you get to the heatmap in Figure 2A.
Deliverables:
- A table or plot of the fold changes for the genes highlighted in Figure 1D, showing whether your results agree with those in the paper.
- An augmented table from that in 6 of enriched biological pathways that includes a comparison with the results reported in the paper.
- A clustered heatmap of FPKM values using at most the top 1000 DE genes found in the P0 vs Ad analysis.
8. Discuss Your Findings
Discuss your findings with your team members and other teams. Some interesting questions to consider:
- How well do our numbers of differentially expressed genes agree with theirs? What are some of the potential differences between your analysis and theirs that might have caused different numbers? If you see a difference, is that a concern?
- Why did the authors do so many different experiments (in vitro, in vivo, explant, etc)? Since we only compared two of the samples they produced, do we expect to have good agreement between our results? Is the degree to which we see agreement surprising?
- Was it easy to verify that you replicated the results from this study, based on the text and figures of the paper? What other information could the authors have included in their publication to aid in assessing this?
- A short list of the enriched GO terms found in the study is in Figure 1C. How does the functional enrichment of the differentially expressed genes you found compare to what was reported in the paper?
Assignment Writeup
Refer to the Project Writeup Instructions.