GEO: Gene Expression Omnibus
SRA: Sequence Read Archive
ArrayExpress: European version of GEO
The NCBI database which stores sequence data obtained from next generation sequence (NGS) technology
Search metadata to locate the sequence reads for download and further downstream analyses
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi
https://www.ncbi.nlm.nih.gov/sra/
The NCBI sratoolkit
provides two command line tools to allow local BLAST searches against specific sra files directly
fastq-dump
: Convert SRA data into fastq format
prefetch
: Allows command-line downloading of SRA, dbGaP, and ADSP data
sam-dump
: Convert SRA data to sam format
sra-pileup
: Generate pileup statistics on aligned SRA data
vdb-config
: Display and modify VDB configuration information
vdb-decrypt
: Decrypt non-SRA dbGaP data ("phenotype data")
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
.sra
files are NOT FASTQ files - need to further convert them using sratoolkit
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP101/SRP101962/SRR5346141/SRR5346141.sra# To split paired-end reads, use -I optionsratoolkit.2.8.1-win64/bin/fastq-dump -I --split-files SRR5346141
https://www.ncbi.nlm.nih.gov/books/NBK47528/
Bacterial and eukaryotic genomes available from PacBio DevNet
https://github.com/PacificBiosciences/DevNet/wiki/Datasets
Kim KE, Peluso P, Babayan P, Yeadon PJ, Yu C, Fisher WW, Chin C-S, Rapicavoli NA, Rank DR, Li J, et al. 2014. Long-read, whole-genome shotgun sequence data for five model organisms. Sci Data 1: 140045.
The UCSC genome browser is a graphical viewer for visualizing genome annotations
Initially developed by Jim Kent on 2000 when he was a Ph.D. student in Biology
Host genomic annotation data for many species
Provide other tools for genomic data analysis and interfaces for querying the database
https://genome.ucsc.edu/FAQ/FAQgenes.html
Track hubs are web-accessible (HTTP or FTP) directories of genomic data that can be viewed on the UCSC Genome Browser
Tracks can be aggregated using a text document in the UCSC Genome Browser track hub format
http://genome.ucsc.edu/goldenpath/help/hgTrackHubHelp.html
Minimum set of track description fields:
track - Symbolic name of the track
type - One of the supported formats
bigDataUrl - Web location (URL) of the data file
shortLabel - Short track description (Max 17 characters)
longLabel - Longer track description (displayed over tracks in the browser)
track McGill_MS000101_monocyte_RNASeq_signal_forwardtype bigWigbigDataUrl http://epigenomesportal.ca/public_data/MS000101.monocyte.RNASeq.signal_forward.bigWig shortLabel 000101mono.rnalongLabel MS000101 | human | monocyte | RNA-Seq | signal_forwardtrack McGill_MS000101_monocyte_RNASeq_signal_reversetype bigWigbigDataUrl http://epigenomesportal.ca/public_data/MS000101.monocyte.RNASeq.signal_reverse.bigWig shortLabel 000101mono.rnalongLabel MS000101 | human | monocyte | RNA-Seq | signal_reverse
Visualizing (Epi)Genomics Data
Includes Roadmap Epigenome data
Supports many track types included in the UCSC Browser
Can also load UCSC track hub documents
https://epigenomegateway.wustl.edu/
General
Species-specific genome browser
TCGA (The Cancer Genome Atlas) data portal, https://cancergenome.nih.gov/
ENCODE (the ENCyclopedia Of DNA Elements) data coordination center (http://genome.ucsc.edu/ENCODE/):
Connectivity Map - a collection of gene expression data from human cells treated with bioactive small molecules. More than 7,000 expression profiles representing 1,309 compounds
CLUE Connectivity Map - >3M gene expression profiles and >1M replicate-collapsed signatures
API access, https://clue.io/api
Many analytical tools, http://lincsproject.org/
Query your up/downregulated genes, https://clue.io/l1000-query
Subramanian, Aravind, Rajiv Narayan, Steven M. Corsello, David D. Peck, Ted E. Natoli, Xiaodong Lu, Joshua Gould, et al. “A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles.” Cell, (November 2017)
https://jhubiostatistics.shinyapps.io/recount/, https://bioconductor.org/packages/recount/
Collado-Torres, Leonardo, Abhinav Nellore, Kai Kammers, Shannon E Ellis, Margaret A Taub, Kasper D Hansen, Andrew E Jaffe, Ben Langmead, and Jeffrey T Leek. “Reproducible RNA-Seq Analysis Using Recount2.” Nature Biotechnology, (April 11, 2017)
https://maayanlab.cloud/archs4/
Lachmann, Alexander, Denis Torre, Alexandra B. Keenan, Kathleen M. Jagodnik, Hyojin J. Lee, Moshe C. Silverstein, Lily Wang, and Avi Ma’ayan. “Massive Mining of Publicly Available RNA-Seq Data from Human and Mouse.” BioRxiv, January 1, 2017
ExperimentHub provides a central location where curated data from experiments, publications or training courses can be accessed
Each resource has associated metadata, tags and date of modification
The R package client creates and manages a local cache of files retrieved enabling quick and reproducible access
Usage similar to AnnotationHub
https://bioconductor.org/packages/ExperimentHub/
http://software.broadinstitute.org/software/igv/
Features
Explore large genomic datasets with an intuitive, easy-to-use interface
Integrate multiple data types with clinical and other sample information
View data from multiple sources:
Automation of specific tasks using command-line interface
Tutorial: https://github.com/griffithlab/rnaseq_tutorial/wiki/IGV-Tutorial
https://bioconductor.org/packages/Gviz/
D3-based interactive visualization tool for functional genomics data.
Multiple visualizations using scatterplots, heatmaps and other user-supplied visualizations.
Includes data from the Gene Expression Barcode project for transcriptome visualization.
https://bioconductor.org/packages/ggbio/
http://www.sthda.com/english/wiki/ggbio-visualize-genomic-data
Gel, Bernat, and Eduard Serra. “KaryoploteR: An R/Bioconductor Package to Plot Customizable Genomes Displaying Arbitrary Data.” Bioinformatics 33, no. 19 (October 1, 2017): 3088–90. https://doi.org/10.1093/bioinformatics/btx346.
Review of omics data visualization tools, summary table: Schroeder, Michael P., Abel Gonzalez-Perez, and Nuria Lopez-Bigas. “Visualizing Multidimensional Cancer Genomics Data.” Genome Medicine, (2013)
GIVE (Genomic Interaction Visualization Engine) - an open source programming library that allows anyone with HTML programming experience to build custom genome browser websites or apps
Cao, Xiaoyi, Zhangming Yan, Qiuyang Wu, Alvin Zheng, and Sheng Zhong. “Building a Genome Browser with GIVE.” BioRxiv, January 1, 2018. https://zhong-lab-ucsd.github.io/GIVE_homepage/
Web-based framework offering a user-friendly interface mapping to most popular bioinformatics tools
Allows for reproducible results
Ability to design custom pipelines and import others’
Tailored for small/medium scale projects with not too many samples
BaseSpace - Illumina-oriented cloud computing environment, https://basespace.illumina.com/home/index
GenePattern - web-based computational biology suite of tools for genomic analysis. http://software.broadinstitute.org/cancer/software/genepattern/
GenomeSpace - integrated environment of the aforementioned genomic platforms allowing the data to be stored in one place and analyzed by a multitude of tools. http://www.genomespace.org/
Side-by-side comparison of many resources https://docs.google.com/spreadsheets/d/1o8iYwYUy0V7IECmu21Und3XALwQihioj23WGv-w0itk/pubhtml
Langmead, Ben, and Abhinav Nellore. “Cloud Computing for Genomic Data Analysis and Collaboration.” Nature Reviews Genetics, January 30, 2018.
GEO: Gene Expression Omnibus
SRA: Sequence Read Archive
ArrayExpress: European version of GEO
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |