Final project

Due by 5:00 AM on Monday, May 3, 2021

General description: The purpose of the final project is for you to gain familiarity with the main methods learned in class, applied to experimental sequencing data. Additionally, the project should help solidify your statistical and practical understanding of such methods.

Processing and analysis of RNA-seq data

Dataset selection (due April 12, 2021, 5:00pm EDT)

You must select an RNA-seq dataset to analyze. Prerequisites for a dataset:

  • at least two experimental conditions.
  • at least three samples per condition.
  • human data, but model organism data are acceptable.
  • cancer-specific data is preferred, but other diseases are acceptable.
  • mRNA (gene expression) is preferred, but different data types (e.g., methylation) may be acceptable after discussion with the instructor.
  • Single-end or paired-end sequenging data is allowed.
  • Download the data (FASTQ files) on your computer.

Search the NCBI Gene Expression Omnibus (GEO) database, using keywords of interest, e.g., “prostate cancer RNA-seq.”

Find and read the associated paper describing the dataset - it should report differentially expressed genes and functional enrichment analysis results. Aim to re-create the published results. Submit the description of the data (“Introduction/Background” section, see “Reporting requirements” section). Include brief methods for obtaining the data. If working in a team, each team member submits the same data description.

Analysis (due May 3, 2021, 5:00pm EDT)

Perform the following analyses, explain your observations in each.

  • Download raw FASTQ files, describe the process.
  • Perform data quality control using FASTQC followed by multiqc.
  • Perform adapter trimming using Trimgalore and Trimmomatic.
  • Align the data using bowtie2 and STAR aligners.
  • Obtain gene counts using featureCounts and HTSeq-count.
  • Exploratory data analysis: Number of samples per condition, gene expression distribution per sample (boxplots), correlogram (ComplexHeatmap), PCA.
  • Differential expression analysis using edgeR and DESeq2. Make the heatmap (ComplexHeatmap) and the volcano plot (EnhancedVolcano) of differentially expressed genes.
  • Perform functional enrichment analysis (hypergeometric test and GSEA) of differentially expressed genes using Gene Ontology, KEGG, and MSigDb (clusterProfiler). Visualize two top KEGG pathways (pathviewer) and overlay differentially expressed genes.
  • Compare lists of differentially expressed genes obtained after two different pipelines and the functional enrichment analysis results. Use a Venn diagram (VennDetail) and consider the up- and downregulated directionality of changes. Describe the discrepancies.
  • Compare lists of differentially expressed genes obtained after two different pipelines and the functional enrichment analysis results with those reported in the original paper. Describe the discrepancies.

Teamwork: One team will use the same dataset. One team member will use one sequence of tools (e.g., Trimgalore-bowtie2-featureCounts), nonoverlapping with tools used by another. Both team members should explore and understand the functionality of all tools. Each team member will submit their own report. If working individually, use one sequence of tools only and perform a comparison with published results.

Reporting requirements

Your project reports should be written in R/Markdown format and compiled as a PDF document. Follow the IMRaD format when describing your project and results. Be concise when describing the results. Embed figures, small tables, and references in the report. Make supplementary figures/tables for large results output, if needed. For each part, address the following points:

  • A simple and clear description of the datasets and the research question you are addressing. This should be written in the form of an Introduction/Background section(s).
  • A Methods section
  • A Results section providing a description of your results. Tables and figures should be numbered and captioned.
    • Data component: Include description/code for obtaining and processing raw data; do not include raw data files. Include processed data/results as text or CSV files that your code chunks will process/generate. Gzip text files.
    • Computational component: code chunks analyzing the data in the Rmd file. Make sure your code is readable (use formatR::tidy_app() or the styler R packages), and commented.
  • A Discussion/Conclusion section.
  • References.

Submitting guidelines

  • Submit Rmds and PDFs of the reports to the blackboard, https://blackboard.vcu.edu
  • Include a compressed project folder with your code, processed data/results in text/CSV format. Do not include raw data; instead, provide description/code for data downloading. The total size of the submission should not exceed 10Mb.

Due date

May 3, 2021, at 5:00pm EDT.