+ - 0:00:00
Notes for current slide
Notes for next slide

Genomic resources

Mikhail Dozmorov

Virginia Commonwealth University

02-15-2021

1 / 27

High-throughput data repositories

  • GEO: Gene Expression Omnibus

    • Host array- and sequencing-based processed data
  • SRA: Sequence Read Archive

    • Designed for hosting large scale high-throughput sequencing data, e.g., high speed file transfer
    • Data are required to be deposited in one of the databases when paper is accepted
  • ArrayExpress: European version of GEO

    • Better curated than GEO but has less data
2 / 27

Sequence Read Archive (SRA)

  • The NCBI database which stores sequence data obtained from next generation sequence (NGS) technology

    • Archives raw NGS data for various organisms from several platforms (FASTQ files)
    • Serves as a starting point for “secondary analyses”
    • Provides access to data from human clinical samples to authorized users who agree to the datasets’ privacy and usage mandates
  • Search metadata to locate the sequence reads for download and further downstream analyses

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi

https://www.ncbi.nlm.nih.gov/sra/

3 / 27

Getting data from SRA

The NCBI sratoolkit provides two command line tools to allow local BLAST searches against specific sra files directly

  • fastq-dump: Convert SRA data into fastq format

  • prefetch: Allows command-line downloading of SRA, dbGaP, and ADSP data

  • sam-dump: Convert SRA data to sam format

  • sra-pileup: Generate pileup statistics on aligned SRA data

  • vdb-config: Display and modify VDB configuration information

  • vdb-decrypt: Decrypt non-SRA dbGaP data ("phenotype data")

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

4 / 27

Getting data from SRA

.sra files are NOT FASTQ files - need to further convert them using sratoolkit

wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP101/SRP101962/SRR5346141/SRR5346141.sra
# To split paired-end reads, use -I option
sratoolkit.2.8.1-win64/bin/fastq-dump -I --split-files SRR5346141

https://www.ncbi.nlm.nih.gov/books/NBK47528/

5 / 27

Long reads

Bacterial and eukaryotic genomes available from PacBio DevNet

https://github.com/PacificBiosciences/DevNet/wiki/Datasets

Kim KE, Peluso P, Babayan P, Yeadon PJ, Yu C, Fisher WW, Chin C-S, Rapicavoli NA, Rank DR, Li J, et al. 2014. Long-read, whole-genome shotgun sequence data for five model organisms. Sci Data 1: 140045.

6 / 27

UCSC Genome Browser

  • The UCSC genome browser is a graphical viewer for visualizing genome annotations

  • Initially developed by Jim Kent on 2000 when he was a Ph.D. student in Biology

  • Host genomic annotation data for many species

  • Provide other tools for genomic data analysis and interfaces for querying the database

http://genome.ucsc.edu/

https://genome.ucsc.edu/FAQ/FAQgenes.html

7 / 27

UCSC Genome Browser Track Hubs

  • Track hubs are web-accessible (HTTP or FTP) directories of genomic data that can be viewed on the UCSC Genome Browser

  • Tracks can be aggregated using a text document in the UCSC Genome Browser track hub format

    • Advantage: Can be easily distributed to collaborators / users of your resources
    • Disadvantage: Need to generate this text document

http://genome.ucsc.edu/goldenpath/help/hgTrackHubHelp.html

8 / 27

Small track hub example

Minimum set of track description fields:

  • track - Symbolic name of the track

  • type - One of the supported formats

    • bigWig, bigBed, bigGenePred, bam, vcfTabix ...
  • bigDataUrl - Web location (URL) of the data file

  • shortLabel - Short track description (Max 17 characters)

  • longLabel - Longer track description (displayed over tracks in the browser)

9 / 27

Small track hub example

track McGill_MS000101_monocyte_RNASeq_signal_forward
type bigWig
bigDataUrl http://epigenomesportal.ca/public_data/MS000101.monocyte.RNASeq.signal_forward.bigWig
shortLabel 000101mono.rna
longLabel MS000101 | human | monocyte | RNA-Seq | signal_forward
track McGill_MS000101_monocyte_RNASeq_signal_reverse
type bigWig
bigDataUrl http://epigenomesportal.ca/public_data/MS000101.monocyte.RNASeq.signal_reverse.bigWig
shortLabel 000101mono.rna
longLabel MS000101 | human | monocyte | RNA-Seq | signal_reverse
10 / 27

WashU Epigenome Browser

  • Visualizing (Epi)Genomics Data

  • Includes Roadmap Epigenome data

  • Supports many track types included in the UCSC Browser

  • Can also load UCSC track hub documents

https://epigenomegateway.wustl.edu/

11 / 27

Other genome browsers/databases

General

Species-specific genome browser

12 / 27

High-throughput data repositories

  • TCGA (The Cancer Genome Atlas) data portal, https://cancergenome.nih.gov/

    • Host data generated by TCGA, a big consortium to study cancer genomics
    • Huge collection of cancer-related data: different types of genomic, genetic and clinical data for many different types of cancers
  • ENCODE (the ENCyclopedia Of DNA Elements) data coordination center (http://genome.ucsc.edu/ENCODE/):

    • Host data generated by ENCODE, a big consortium to study functional elements of human genome
    • Rich collection of genomic and epigenomic data
13 / 27

Connectivity Map

  • Connectivity Map - a collection of gene expression data from human cells treated with bioactive small molecules. More than 7,000 expression profiles representing 1,309 compounds

  • CLUE Connectivity Map - >3M gene expression profiles and >1M replicate-collapsed signatures

API access, https://clue.io/api

Many analytical tools, http://lincsproject.org/

Query your up/downregulated genes, https://clue.io/l1000-query

Subramanian, Aravind, Rajiv Narayan, Steven M. Corsello, David D. Peck, Ted E. Natoli, Xiaodong Lu, Joshua Gould, et al. “A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles.” Cell, (November 2017)

14 / 27

RECOUNT2 - A multi-experiment resource of RNA-seq gene and exon count datasets

  • Uniformly processed (Rail-RNA) gene- and exon counts
  • Signal coverage in bigWig format
  • Phenotype data
  • RangedSummarizedExperiment R objects

https://jhubiostatistics.shinyapps.io/recount/, https://bioconductor.org/packages/recount/

Collado-Torres, Leonardo, Abhinav Nellore, Kai Kammers, Shannon E Ellis, Margaret A Taub, Kasper D Hansen, Andrew E Jaffe, Ben Langmead, and Jeffrey T Leek. “Reproducible RNA-Seq Analysis Using Recount2.” Nature Biotechnology, (April 11, 2017)

15 / 27
  • A web resource that makes the majority of previously published RNA-seq data from human and mouse freely available at the gene count level
  • All available FASTQ files from RNA-seq experiments were retrieved from the Gene Expression Omnibus (GEO) and aligned using a cloud-based infrastructure.
  • 72,363 mouse and 65,429 human samples. Processed data in HDF5 format
  • Gene-centric exploratory analysis of average expression across cell lines and tissues, top co-expressed genes, and predicted biological functions and protein-protein interactions for each gene based on prior knowledge combined with co-expression

https://maayanlab.cloud/archs4/

Lachmann, Alexander, Denis Torre, Alexandra B. Keenan, Kathleen M. Jagodnik, Hyojin J. Lee, Moshe C. Silverstein, Lily Wang, and Avi Ma’ayan. “Massive Mining of Publicly Available RNA-Seq Data from Human and Mouse.” BioRxiv, January 1, 2017

16 / 27

ExperimentHub

  • ExperimentHub provides a central location where curated data from experiments, publications or training courses can be accessed

  • Each resource has associated metadata, tags and date of modification

  • The R package client creates and manages a local cache of files retrieved enabling quick and reproducible access

  • Usage similar to AnnotationHub

https://bioconductor.org/packages/ExperimentHub/

17 / 27

Visualization: Integrative Genomics Viewer (IGV)

http://software.broadinstitute.org/software/igv/

18 / 27

Visualization: Integrative Genomics Viewer (IGV)

Features

  • Explore large genomic datasets with an intuitive, easy-to-use interface

  • Integrate multiple data types with clinical and other sample information

  • View data from multiple sources:

    • local, remote, and "cloud-based"
    • Intelligent remote file handling - no need to download the whole dataset
  • Automation of specific tasks using command-line interface

Tutorial: https://github.com/griffithlab/rnaseq_tutorial/wiki/IGV-Tutorial

19 / 27

Gviz R package

  • Plotting data and annotation information along genomic coordinates
  • Track-oriented

https://bioconductor.org/packages/Gviz/

20 / 27

epivizR R package

  • D3-based interactive visualization tool for functional genomics data.

  • Multiple visualizations using scatterplots, heatmaps and other user-supplied visualizations.

  • Includes data from the Gene Expression Barcode project for transcriptome visualization.

http://epiviz.cbcb.umd.edu/

https://epiviz.github.io/

21 / 27

karyotypeR

  • karyoploteR is an R package to create karyoplots, that is, representations of whole genomes with arbitrary data plotted on them

Gel, Bernat, and Eduard Serra. “KaryoploteR: An R/Bioconductor Package to Plot Customizable Genomes Displaying Arbitrary Data.” Bioinformatics 33, no. 19 (October 1, 2017): 3088–90. https://doi.org/10.1093/bioinformatics/btx346.

https://bioconductor.org/packages/karyoploteR/

https://bernatgel.github.io/karyoploter_tutorial/

23 / 27

Other visualization tools

Review of omics data visualization tools, summary table: Schroeder, Michael P., Abel Gonzalez-Perez, and Nuria Lopez-Bigas. “Visualizing Multidimensional Cancer Genomics Data.” Genome Medicine, (2013)

GIVE (Genomic Interaction Visualization Engine) - an open source programming library that allows anyone with HTML programming experience to build custom genome browser websites or apps

Cao, Xiaoyi, Zhangming Yan, Qiuyang Wu, Alvin Zheng, and Sheng Zhong. “Building a Genome Browser with GIVE.” BioRxiv, January 1, 2018. https://zhong-lab-ucsd.github.io/GIVE_homepage/

24 / 27

Galaxy

  • Web-based framework offering a user-friendly interface mapping to most popular bioinformatics tools

    • "Data intensive biology for everyone"
  • Allows for reproducible results

    • Steps / parameters kept in history
  • Ability to design custom pipelines and import others’

    • All through a user-friendly GUI
  • Tailored for small/medium scale projects with not too many samples

https://usegalaxy.org/

25 / 27

Other resources

Side-by-side comparison of many resources https://docs.google.com/spreadsheets/d/1o8iYwYUy0V7IECmu21Und3XALwQihioj23WGv-w0itk/pubhtml

26 / 27

Summarized data sets, services and resources

Langmead, Ben, and Abhinav Nellore. “Cloud Computing for Genomic Data Analysis and Collaboration.” Nature Reviews Genetics, January 30, 2018.

27 / 27

High-throughput data repositories

  • GEO: Gene Expression Omnibus

    • Host array- and sequencing-based processed data
  • SRA: Sequence Read Archive

    • Designed for hosting large scale high-throughput sequencing data, e.g., high speed file transfer
    • Data are required to be deposited in one of the databases when paper is accepted
  • ArrayExpress: European version of GEO

    • Better curated than GEO but has less data
2 / 27
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow