class: center, middle, inverse, title-slide # Genomic resources ### Mikhail Dozmorov ### Virginia Commonwealth University ### 02-15-2021 --- ## High-throughput data repositories - **GEO**: Gene Expression Omnibus - Host array- and sequencing-based processed data - **SRA**: Sequence Read Archive - Designed for hosting large scale high-throughput sequencing data, e.g., high speed file transfer - Data are required to be deposited in one of the databases when paper is accepted - **ArrayExpress**: European version of GEO - Better curated than GEO but has less data --- ## Sequence Read Archive (SRA) - The NCBI database which stores sequence data obtained from next generation sequence (NGS) technology - Archives raw NGS data for various organisms from several platforms (FASTQ files) - Serves as a starting point for “secondary analyses” - Provides access to data from human clinical samples to authorized users who agree to the datasets’ privacy and usage mandates - Search metadata to locate the sequence reads for download and further downstream analyses https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi https://www.ncbi.nlm.nih.gov/sra/ --- ## Getting data from SRA The NCBI `sratoolkit` provides two command line tools to allow local BLAST searches against specific sra files directly - `fastq-dump`: Convert SRA data into fastq format - `prefetch`: Allows command-line downloading of SRA, dbGaP, and ADSP data - `sam-dump`: Convert SRA data to sam format - `sra-pileup`: Generate pileup statistics on aligned SRA data - `vdb-config`: Display and modify VDB configuration information - `vdb-decrypt`: Decrypt non-SRA dbGaP data ("phenotype data") https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software --- ## Getting data from SRA `.sra` files are NOT FASTQ files - need to further convert them using `sratoolkit` ``` wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP101/SRP101962/SRR5346141/SRR5346141.sra # To split paired-end reads, use -I option sratoolkit.2.8.1-win64/bin/fastq-dump -I --split-files SRR5346141 ``` https://www.ncbi.nlm.nih.gov/books/NBK47528/ --- ## Long reads Bacterial and eukaryotic genomes available from PacBio DevNet https://github.com/PacificBiosciences/DevNet/wiki/Datasets Kim KE, Peluso P, Babayan P, Yeadon PJ, Yu C, Fisher WW, Chin C-S, Rapicavoli NA, Rank DR, Li J, et al. 2014. Long-read, whole-genome shotgun sequence data for five model organisms. Sci Data 1: 140045. --- ## UCSC Genome Browser - The UCSC genome browser is a graphical viewer for visualizing genome annotations - Initially developed by Jim Kent on 2000 when he was a Ph.D. student in Biology - Host genomic annotation data for many species - Provide other tools for genomic data analysis and interfaces for querying the database http://genome.ucsc.edu/ https://genome.ucsc.edu/FAQ/FAQgenes.html --- ## UCSC Genome Browser Track Hubs - Track hubs are web-accessible (HTTP or FTP) directories of genomic data that can be viewed on the UCSC Genome Browser - Tracks can be aggregated using a text document in the UCSC Genome Browser track hub format - Advantage: Can be easily distributed to collaborators / users of your resources - Disadvantage: Need to generate this text document http://genome.ucsc.edu/goldenpath/help/hgTrackHubHelp.html --- ## Small track hub example Minimum set of track description fields: - _track_ - Symbolic name of the track - _type_ - One of the supported formats - bigWig, bigBed, bigGenePred, bam, vcfTabix ... - _bigDataUrl_ - Web location (URL) of the data file - _shortLabel_ - Short track description (Max 17 characters) - _longLabel_ - Longer track description (displayed over tracks in the browser) --- ## Small track hub example ``` track McGill_MS000101_monocyte_RNASeq_signal_forward type bigWig bigDataUrl http://epigenomesportal.ca/public_data/MS000101.monocyte.RNASeq.signal_forward.bigWig shortLabel 000101mono.rna longLabel MS000101 | human | monocyte | RNA-Seq | signal_forward track McGill_MS000101_monocyte_RNASeq_signal_reverse type bigWig bigDataUrl http://epigenomesportal.ca/public_data/MS000101.monocyte.RNASeq.signal_reverse.bigWig shortLabel 000101mono.rna longLabel MS000101 | human | monocyte | RNA-Seq | signal_reverse ``` --- ## WashU Epigenome Browser - Visualizing (Epi)Genomics Data - Includes Roadmap Epigenome data - Supports many track types included in the UCSC Browser - Can also load UCSC track hub documents https://epigenomegateway.wustl.edu/ <!-- ## Large genomics projects and resources misc/Table_big_data.xlsx - **1000 Genomes Project (1KGP)**, https://www.internationalgenome.org/, This project includes whole-genome and exome sequencing data from 2,504 individuals across 26 populations - **Cancer Cell Line Encyclopedia (CCLE)**, https://portals.broadinstitute.org/ccle, This resource includes data spanning 1,457 cancer cell lines - **Encyclopedia of DNA Elements (ENCODE)**, https://www.encodeproject.org/, The goal of this project is to identify functional elements of the human genome using a gamut of sequencing assays across cell lines and tissues - **Genome Aggregation Database (gnomAD)**, https://gnomad.broadinstitute.org/, This resource entails coverage and allele frequency information from over 120,000 exomes and 15,000 whole genomes ## Large genomics projects and resources - **Genotype–Tissue Expression (GTEx) Portal**, https://gtexportal.org/, This effort has to date performed RNA sequencing or genotyping of 714 individuals across 53 tissues - **Global Alliance for Genomics and Health (GA4GH)**, https://www.ga4gh.org, This consortium of over 400 institutions aims to standardize secure sharing of genomic and clinical data - **International Cancer Genome Consortium (ICGC)**, https://daco.icgc.org/, This consortium spans 76 projects, including TCGA - **Million Veterans Program (MVP)**, https://www.research.va.gov/mvp/, This US programme aims to collect blood samples and health information from 1 million military veterans ## Large genomics projects and resources - **Model Organism Encyclopedia of DNA Elements (modENCODE)**, http://www.modencode.org/, The goal of this effort is to identify functional elements of the Drosophila melanogaster and Caenorhabditis elegans genomes using a gamut of sequencing assays - **Precision Medicine Initiative (PMI)**, https://allofus.nih.gov/, This US programme aims to collect genetic data from over 1 million individuals - **The Cancer Genome Atlas (TCGA)**, https://cancergenome.nih.gov, This resource includes data from 11,350 individuals spanning 33 cancer types - **Trans-Omics for Precision Medicine (TOPMed)**, https://www.nhlbiwgs.org, The goal of this programme is to build a commons with omics data and associated clinical outcomes data across populations for research on heart, lung, blood and sleep disorders .small[Langmead, Ben, and Abhinav Nellore. “[Cloud Computing for Genomic Data Analysis and Collaboration](https://doi.org/10.1038/nrg.2017.113).” Nature Reviews Genetics, January 30, 2018] --> --- ## Other genome browsers/databases **General** - NCBI Genome Data Viewer, https://www.ncbi.nlm.nih.gov/genome/gdv/ - Ensembl genome browser, https://www.ensembl.org/ **Species-specific genome browser** - **MGI**: Mouse genome informatics, http://www.informatics.jax.org/ - **wormbase** http://www.wormbase.org/ - **Flybase** http://flybase.org/ - **SGD** (yeast) https://www.yeastgenome.org/ - **TAIR DB** (arabidopsis) https://www.arabidopsis.org/ - **MBGD microbial genome database** http://mbgd.genome.ad.jp/ --- ## High-throughput data repositories - **TCGA** (The Cancer Genome Atlas) data portal, https://cancergenome.nih.gov/ - Host data generated by TCGA, a big consortium to study cancer genomics - Huge collection of cancer-related data: different types of genomic, genetic and clinical data for many different types of cancers - **ENCODE** (the ENCyclopedia Of DNA Elements) data coordination center (http://genome.ucsc.edu/ENCODE/): - Host data generated by ENCODE, a big consortium to study functional elements of human genome - Rich collection of genomic and epigenomic data --- ## Connectivity Map - [Connectivity Map](https://portals.broadinstitute.org/cmap/) - a collection of gene expression data from human cells treated with bioactive small molecules. More than 7,000 expression profiles representing 1,309 compounds - [CLUE Connectivity Map](https://clue.io/) - >3M gene expression profiles and >1M replicate-collapsed signatures API access, https://clue.io/api Many analytical tools, http://lincsproject.org/ Query your up/downregulated genes, https://clue.io/l1000-query Subramanian, Aravind, Rajiv Narayan, Steven M. Corsello, David D. Peck, Ted E. Natoli, Xiaodong Lu, Joshua Gould, et al. “[A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles](https://doi.org/10.1016/j.cell.2017.10.049).” Cell, (November 2017) --- ## RECOUNT2 - A multi-experiment resource of RNA-seq gene and exon count datasets .center[ <img src="img/recount2.png" height = 150> ] - Uniformly processed (Rail-RNA) gene- and exon counts - Signal coverage in bigWig format - Phenotype data - RangedSummarizedExperiment R objects https://jhubiostatistics.shinyapps.io/recount/, https://bioconductor.org/packages/recount/ .small[ Collado-Torres, Leonardo, Abhinav Nellore, Kai Kammers, Shannon E Ellis, Margaret A Taub, Kasper D Hansen, Andrew E Jaffe, Ben Langmead, and Jeffrey T Leek. “[Reproducible RNA-Seq Analysis Using Recount2](https://doi.org/10.1038/nbt.3838).” Nature Biotechnology, (April 11, 2017) ] --- ## ARCHS4 - all RNA-seq and ChIP-seq sample and signature search - A web resource that makes the majority of previously published RNA-seq data from human and mouse freely available at the gene count level - All available FASTQ files from RNA-seq experiments were retrieved from the Gene Expression Omnibus (GEO) and aligned using a cloud-based infrastructure. - 72,363 mouse and 65,429 human samples. Processed data in HDF5 format - Gene-centric exploratory analysis of average expression across cell lines and tissues, top co-expressed genes, and predicted biological functions and protein-protein interactions for each gene based on prior knowledge combined with co-expression https://maayanlab.cloud/archs4/ .small[ Lachmann, Alexander, Denis Torre, Alexandra B. Keenan, Kathleen M. Jagodnik, Hyojin J. Lee, Moshe C. Silverstein, Lily Wang, and Avi Ma’ayan. “[Massive Mining of Publicly Available RNA-Seq Data from Human and Mouse](https://doi.org/10.1101/189092).” BioRxiv, January 1, 2017] --- ## ExperimentHub - ExperimentHub provides a central location where curated data from experiments, publications or training courses can be accessed - Each resource has associated metadata, tags and date of modification - The R package client creates and manages a local cache of files retrieved enabling quick and reproducible access - Usage similar to `AnnotationHub` https://bioconductor.org/packages/ExperimentHub/ --- ## Visualization: Integrative Genomics Viewer (IGV) .center[ <img src="img/igv.png" height = 450> ] http://software.broadinstitute.org/software/igv/ --- ## Visualization: Integrative Genomics Viewer (IGV) Features - Explore large genomic datasets with an intuitive, easy-to-use interface - Integrate multiple data types with clinical and other sample information - View data from multiple sources: - local, remote, and "cloud-based" - Intelligent remote file handling - no need to download the whole dataset - Automation of specific tasks using command-line interface Tutorial: https://github.com/griffithlab/rnaseq_tutorial/wiki/IGV-Tutorial --- ## Gviz R package - Plotting data and annotation information along genomic coordinates - Track-oriented .center[ <img src="img/gviz.png" height = 350> ] https://bioconductor.org/packages/Gviz/ --- ## epivizR R package - D3-based interactive visualization tool for functional genomics data. - Multiple visualizations using scatterplots, heatmaps and other user-supplied visualizations. - Includes data from the Gene Expression Barcode project for transcriptome visualization. http://epiviz.cbcb.umd.edu/ https://epiviz.github.io/ --- ## ggbio R package - ggplot2 for genomic data .center[ <img src="img/ggbio-show-mutation.png" height = 350> ] https://bioconductor.org/packages/ggbio/ http://www.sthda.com/english/wiki/ggbio-visualize-genomic-data --- ## karyotypeR - karyoploteR is an R package to create karyoplots, that is, representations of whole genomes with arbitrary data plotted on them .center[ <img src="img/karyoploter.png" height = 300> ] .small[ Gel, Bernat, and Eduard Serra. “KaryoploteR: An R/Bioconductor Package to Plot Customizable Genomes Displaying Arbitrary Data.” Bioinformatics 33, no. 19 (October 1, 2017): 3088–90. https://doi.org/10.1093/bioinformatics/btx346. https://bioconductor.org/packages/karyoploteR/ https://bernatgel.github.io/karyoploter_tutorial/ ] --- ## Other visualization tools Review of omics data visualization tools, summary table: Schroeder, Michael P., Abel Gonzalez-Perez, and Nuria Lopez-Bigas. “[Visualizing Multidimensional Cancer Genomics Data](https://doi.org/10.1186/gm413).” Genome Medicine, (2013) [GIVE (Genomic Interaction Visualization Engine)](https://genomemedicine.biomedcentral.com/articles/10.1186/gm413) - an open source programming library that allows anyone with HTML programming experience to build custom genome browser websites or apps Cao, Xiaoyi, Zhangming Yan, Qiuyang Wu, Alvin Zheng, and Sheng Zhong. “[Building a Genome Browser with GIVE](https://doi.org/10.1101/177832).” BioRxiv, January 1, 2018. https://zhong-lab-ucsd.github.io/GIVE_homepage/ --- ## Galaxy - Web-based framework offering a user-friendly interface mapping to most popular bioinformatics tools - "Data intensive biology for everyone" - Allows for reproducible results - Steps / parameters kept in history - Ability to design custom pipelines and import others’ - All through a user-friendly GUI - Tailored for small/medium scale projects with not too many samples https://usegalaxy.org/ --- ## Other resources - **BaseSpace** - Illumina-oriented cloud computing environment, https://basespace.illumina.com/home/index - **GenePattern** - web-based computational biology suite of tools for genomic analysis. http://software.broadinstitute.org/cancer/software/genepattern/ - **GenomeSpace** - integrated environment of the aforementioned genomic platforms allowing the data to be stored in one place and analyzed by a multitude of tools. http://www.genomespace.org/ Side-by-side comparison of many resources https://docs.google.com/spreadsheets/d/1o8iYwYUy0V7IECmu21Und3XALwQihioj23WGv-w0itk/pubhtml --- ## Summarized data sets, services and resources <!-- misc/Table_resources.xlsx --> .center[ <img src="img/big_data_sets.png" height = 450> ] .small[ Langmead, Ben, and Abhinav Nellore. “[Cloud Computing for Genomic Data Analysis and Collaboration](https://doi.org/10.1038/nrg.2017.113).” Nature Reviews Genetics, January 30, 2018.]