Project 2: Transcriptional Profile of Mammalian Cardiac Regeneration with mRNA-Seq

High-throughput sequencing technologies have helped researchers expand their understanding of genome-wide gene expression through their ability to detect potentially any mRNA molecule in a sample, rather than measuring only molecules with specific sequences as with microarrays. This agnostic approach to molecular profiling allows us to ask different and somewhat more complex questions of a sequencing dataset than a microarray dataset, (e.g. novel splice patterns, lincRNAs, coding mutations, etc), but the most basic information, mRNA abundance, is analogous to microarray expression measurements and extremely useful. In this project, you will download, QC, process, and analyze sequencing data that was generated to better understand how neonatal mice are able to regenerate their heart tissue but lose this ability later in development.

Upon completion of assignment 2, students will be able to do the following:

  • Download and extract sequencing reads from publicly available mRNA-Seq datasets
  • Understand the most common short read sequencing file formats and which available tools are useful in analyzing sequencing data
  • Align and normalize sequencing reads to an appropriate genome using the tuxedo suite of short read mapping tools
  • Compute and evaluate quality control metrics for sequencing datasets
  • Examine differential expression identified with cuffdiff and interpret the results using DAVID

Contents:

1. Read the paper and create a workflow diagram for your role

Everyone

O’Meara et al. Transcriptional Reversion of Cardiac Myocyte Fate During Mammalian Cardiac Regeneration. Circ Res. Feb 2015. PMID: 25477501

In project 1, we gave you a simple, conceptual workflow diagram that outlined the major steps of the analysis and their dependencies (order). For this project, we ask that you generate one of your own that will be included in the written report. Individually, read through your assigned role and create a diagram for your section. For your report, you will work as a group to connect all of your sections’ diagrams into one overarching project workflow. These diagrams should include all major steps of the analysis and be linked in a manner that demonstrates their dependencies on other steps. Keep these diagrams simple with boxes representing what analysis / step is occurring and arrows or lines showing the dependencies between steps. Do not worry about making these diagrams aesthetically perfect, you may quickly make such diagrams in PowerPoint or any other appropriate software of your choosing. You only need to include the full workflow diagram detailing the entire project in your report.

It might be helpful when you are first starting to individually make diagrams that include all of the steps for your role in order to understand the process you will be following (even the non-necessary steps). Then, when you are working in a group to combine them, you can come to a consensus as to which steps are superfluous for the sake of reproducing the results and which are essential.

Please note that there is an inherent level of subjectivity in the generation of these diagrams. We are not looking for any one specific diagram / diagram structure. We only care that you represent the major steps of the analysis and that you connect them in a way that illustrates the order in which they must be performed. Our intent with the generation of these diagrams is to help you visualize and understand all of the key steps that go into a particular analysis and how your role connects and depends on the others.

2. Data acquisition & transfer to remote server

Lead role: Data Curator

As in the last study, all but one of the samples for this study have been downloaded and prepared for you. You will download the one remaining sample and process it.

  1. Download the sample GSM1570702 <https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1570702>_ (vP0_1) from GEO Series GSE64403 <https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE64403>_ to one of your directories on SCC. The filename you should download is SRR1727914.sra and it is 1.1Gb. The file is in SRA (short read archive) format.
  2. Use the command module load sratoolkit to make the SRA tools available. Read the SRAtoolkit documentation <https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc>_ to find the correct command to use to extract the SRA format to FASTQ. This is a paired-end sequencing run, so you will have two FASTQ files. You may find it convenient to rename the SRA file to P0_1.sra before converting to FASTQ. Create a file run_extract.qsub, and put all code for extracting the FASTQ files, including the module command from above, into it and submit it as a batch job using the qsub command. You should use the file /project/bf528/project_2/scripts/qsub_skel.qsub as a template.
  3. Once you have extracted the short reads to FASTQ format, inspect the files with head and ensure the format is what you expect. Also ensure that the text in the header fields of the records are exactly the same between the two files, i.e. the first read in the first FASTQ file should be mated with the first read in the second file.

3. Quality control FASTQ files

Lead role: Data Curator

Now that we have FASTQ files, we will begin processing them and extracting quality measures. We will use the FastQC package available on SCC to do this.

  1. Use the command module load fastqc on the cluster to make the FastQC tool available on the command line.
  2. By default, FastQC is run with a graphical user interface (GUI), but we are going to run it in command line mode. Do this by simply running the fastqc command with the FASTQ files you extracted in the previous step as arguments. You may run fastqc -h to output how to use the fastqc command. You may find the -o|--outdir command line argument helpful.
  3. The tool will create a number of html and image files of the statistics. You can look at the html files if you choose using head, cat, or your favorite command line text editor (nano, emacs, vim, etc). You will not be able to view the images directly on the cluster, so you must download them. Download both the images and the html files to your desktop/laptop computer using a compatible SFTP application (e.g. WinSCP on Windows, RBrowser or Cyberduck on Mac).
  4. Examine the output of FastQC by opening the html files in a web browser. Observe if any of the quality metrics failed.
  5. Report the results of the metrics read quality, GC content bias, sequence duplication levels, and any over-represented sequences you think might indicate a problem with the data in your report.

Deliverables:

  • A summary of the quality of the dataset as assessed by fastqc
  • A description of any anomalies, or questions about data quality
  • Any plots that support the above

4. Aligning and QA using tophat and RSeQC

Lead Role: Programmer

Next we will align the reads to the mouse genome reference called mm9. The reference files are available on the cluster at /project/bf528/project_2/reference. Included in this directory are the mm9 FASTA file (mm9.fa) as well its Bowtie2 indexes. Algorithms that align millions of reads onto a genome always require indexes of the reference sequence. The developers of TopHat host several of the most commonly used indexes which can be found here (http://ccb.jhu.edu/software/tophat/igenomes.shtml). For your convenience, we’ve downloaded the mm9 indexes you’ll need to run TopHat. They are available in the same folder listed above.

To help you get started writing your code, we have downloaded the fastq files from the other timecourse samples in this study under /project/bf528/project_2/data/samples/. Pick one of the samples and use the FASTQ files when writing your alignment and RSeQC scripts. The samples from P0 can then be swapped in later as they become ready.

  1. Use the command module load samtools bowtie2 boost tophat to load the alignment utilities we will need. samtools and boost are dependencies of TopHat. We will use samtools in subsequent steps. Create a file run_tophat.qsub and put all code for running tophat, including the module command from above, into it. You should use the file /project/bf528/project_2/scripts/qsub_skel.qsub as a template.

  2. Construct a command that will align the two FASTQ files you extracted in part 2 against the mm9 reference using TopHat. Running tophat without any arguments will print out how to use it. The arguments in the paper are not formatted well, copy the arguments from /project/bf528/project_2/scripts/tophat_args.txt into your tophat command.

  3. TopHat is a program that requires a substantial amount of memory to run. As a result, it is usually necessary to run TopHat as a batch job on the cluster rather than running it interactively. Once you have written your tophat command into the qsub file, submit the TopHat job using qsub run_tophat.qsub and wait for the alignment to complete. To check the status of your job, issue the command qstat -u your_user_name. For more information about submitting jobs to the queue, please see the SCC documentation page. NB: tophat may take more than an hour to finish.

  4. A file named P0_1_tophat/accepted_hits.bam will be created when TopHat has successfully run to completion. The file is in BAM format, which is a binary version of the SAM (Sequence Alignment/Map) format, and contains all of the original reads plus any alignments discovered by TopHat. Run the command samtools flagstat P0_1_tophat/accepted_hits.bam in the directory where you ran TopHat and examine the results. Report these results in your writeup.

  5. Use the command module load python samtools rseqc to make the RseQC utilities available. You will first need to index the BAM file using samtools index accepted_hits.bam before you can run these commands. After the BAM file has been indexed, and the file accepted_hits.bam.bai exists, run the following first without arguments to see their usage and then with accepted_hits.bam as input:

    geneBody_coverage.py
    inner_distance.py
    bam_stat.py

    Each of these utilities outputs a different quality control metric, and two create plot images which must be downloaded to view. Examine the metrics output by these three tools and explain and interpret them in your report. When a tool requires a bed file as input, use /project/bf528/project_2/reference/annot/mm9.bed.

Deliverables:

  • A report of the total number of reads, number of mapped, unique, multimapped, and unaligned reads with percentages of total reads for each
  • Plots and interpretation of the RSeQC output

6. Identifying Differentially Expressed Genes Associated with Myocyte Differentiation

Lead Role: Analyst

There were many experiments in this study. We are going to reproduce the comparison of postnatal day 0 (P0) versus Adult from Figure 1B. The output from cuffdiff in 5.4 should have produced the file cuffdiff_out/gene_exp.diff. This file contains the differential expression statistics comparing the two conditions. We will examine this file to interpret the results.

To help you get started writing your code before the results from earlier steps are ready, we have provided a differential expression analysis from P4 vs P7 in the file /project/bf528/project_2/data/P4_vs_P7_cuffdiff_out/gene_exp.diff. You may use this file to develop your code and then swap in the analysis from above when it is ready.

  1. Load the file cuffdiff_out/gene_exp.diff into R. Sort the data frame so that the smallest q_values are at the top (hint: look at the order function). Produce a table of the top ten differentially expressed genes, with their names, FPKM values, log fold change, p-value, and q-value in your report.

  2. Produce a histogram of the log2.foldchange column for all genes with the hist function. Try specifying different values for the breaks argument to control the number of bars in the plot and pick one you like best.

  3. Create a new data frame that contains only the genes where the last column, named significant, is equal to yes. The subset function is useful for this kind of task, for example if df is a data frame with a column named A, then we can write df.sub <- subset(df,A==”x”). Note the nominal \(p\)-value and \(q\)-value that these genes have in your report.

  4. Create a second histogram of the log2 fold change values only for significant genes. What do you notice?

  5. Further subset the significant gene data frame you just created into two separate data frames with only the up- and down-regulated genes using the log2.foldchange column. Include the number of up and down regulated genes in your report.

  6. Using the write function, write out the up- and down- regulated gene names to separate files. The files should have one gene name on each line (leave genes that are separated by commas on the same line as is). Copy these files to your laptop or desktop, as we will be uploading them to a web application to perform gene set analysis.

  7. DAVID Functional Annotation Clustering groups gene sets based on the genes they share. The output of this tool attempts to organize the enriched gene sets into functionally related clusters. The Score attribute of the clusters is equal to the -log10(average p-value) from enriched gene sets. In a web browser, go to https://david.ncifcrf.gov/summary.jsp .

    1. On the left of the page, look for the Upload tab, and upload the up regulated gene list file, or copy and paste the gene list into the first text box.
    2. Choose OFFICIAL_GENE_SYMBOL from the dropdown
    3. Mark the Gene List radio button
    4. Press Submit
    5. On the next page (Annotation Summary Results), pick Mus Musculus from the box on the left and press Select Species.
    6. On the page to the right, click Clear All, then expand the Gene_Ontology group and select GOTERM_BP_FAT, GOTERM_MF_FAT, and GOTERM_CC_FAT boxes
    7. Click the Functional Annotation Clustering button
    8. A window should appear containing a list of enriched GO terms organized into clusters based on functional relatedness
    9. Examine these results, and save them to file if you wish
    10. Follow this same process for the down regulated gene list

Deliverables:

  • A table of the top 10 differentially expressed genes and statistics from 6.1
  • A histogram of the log2 fold changes
  • A report of the number of differentially expressed genes detected as significant from 6.3, and the numbers of up- and down- regulated genes at this significance level
  • Two csv files containing the up- and down- regulated significant genes from 6.6
  • A table summarizing the top cluster results from the DAVID analysis in 6.7

7. Biological Interpretation

Lead role: Biologist - for 4 person groups only

The authors used the FPKM tables to make biological interpretation of their experiment. We will follow their example and analyze the content of the FPKM expression matrices for biological patterns.

To help you get started writing your code, we have provided a FPKM table of just the other 7 samples, not including P0, and a list of DE genes from P4 vs P7 in /project/bf528/project_2/data/fpkm_matrix.csv and /project/bf528/project_2/data/P4_vs_P7_cuffdiff_out/gene_exp.diff. After you have written this code, it should be easy to substitute in the full results with P0 FPKMs and the P0 vs Ad DE genes when they are ready.

  1. Figure 1D contains plots of genes specific to the most prominent GO terms discovered in the analysis. Using your genes.fpkm_tracking FPKM table and those you find in the different sample directories under /project/bf528/project_2/data/samples/, compare the trends you see in your data with those in the paper. I.e. do you observe the same direction and magnitude of effect for the P0 and Ad samples as reported in the plots?
  2. Compare the results you obtained from the DAVID analysis in 6.7 with those reported in the paper. Further develop the table you created for 6.7 to include annotations for those biological pathways that showed overlap with the results reported in the paper. E.g. include a second column for the manuscript scores, or an asterisk indicating which processes are in common.
  3. Create a FPKM matrix of all 8 samples by concatenating the FPKM columns from each of the tracking tables from 7.1 together into a single dataframe. Subset this matrix by at most the top 1k genes found to be differentially expressed between P0 and Ad from 5.4 and create a clustered heatmap like the one you made in project 1. Genes should be along rows and samples along columns, with dendrograms and labels shown. Compare the heatmap you get to the heatmap in Figure 2A.

Deliverables:

  • A table or plot of the fold changes for the genes highlighted in Figure 1D, showing whether your results agree with those in the paper.
  • An augmented table from that in 6 of enriched biological pathways that includes a comparison with the results reported in the paper.
  • A clustered heatmap of FPKM values using at most the top 1000 DE genes found in the P0 vs Ad analysis.

8. Discuss Your Findings

Discuss your findings with your team members and other teams. Some interesting questions to consider:

  1. How well do our numbers of differentially expressed genes agree with theirs? What are some of the potential differences between your analysis and theirs that might have caused different numbers? If you see a difference, is that a concern?
  2. Why did the authors do so many different experiments (in vitro, in vivo, explant, etc)? Since we only compared two of the samples they produced, do we expect to have good agreement between our results? Is the degree to which we see agreement surprising?
  3. Was it easy to verify that you replicated the results from this study, based on the text and figures of the paper? What other information could the authors have included in their publication to aid in assessing this?
  4. A short list of the enriched GO terms found in the study is in Figure 1C. How does the functional enrichment of the differentially expressed genes you found compare to what was reported in the paper?

Assignment Writeup

Refer to the Project Writeup Instructions.