+ - 0:00:00
Notes for current slide
Notes for next slide

The Cancer Genomics Atlas (TCGA)

Mikhail Dozmorov

Virginia Commonwealth University

02-15-2021

1 / 25

The Cancer Genome Atlas (TCGA)

  • Started December 13, 2005, phase II in 2009, ended in 2014

  • Mission - to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing.

  • Data generation

    • Clinical information about participants
    • Metadata about the samples (e.g. the weight of a sample portion, etc.)
    • Histopathology slide images from sample portions
    • Molecular information derived from the samples (e.g. mRNA/miRNA expression, protein expression, copy number, etc.)

https://cancergenome.nih.gov/

2 / 25

TCGA by the numbers

https://cancergenome.nih.gov/abouttcga

3 / 25

Major TCGA Research Components

  • Biospecimen Core Resource (BCR) - Collect and process tissue samples

  • Genome Sequencing Centers (GSCs) - Use high-throughput Genome Sequencing to identify the changes in DNA sequences in cancer

  • Genome Characterization Centers (GCCs) - Analyze genomic and epigenomic changes involved in cancer

  • Data Coordinating Center (DCC) - The TCGA data are centrally managed at the DCC

  • Genome Data Analysis Centers (GDACs) - These centers provide informatics tools to facilitate broader use of TCGA data

4 / 25

TCGA Data Access Policy

  • An access control policy is in place for TCGA data to ensure that personally identifiable information is kept from unauthorized users

  • Open access - Houses data that cannot be aggregated to generate a data set unique to an individual. This tier does not require user certification for data access

  • Controlled access - Houses individually-unique information that could potentially be used to identify an individual. This tier requires user certification for data access

5 / 25

TCGA Controlled Access Data

Access to controlled data is available to researchers who:

  • Agree to restrict their use of the information to biomedical research purposes only

  • Agree with the statements within TCGA Data Use Certification (DUC)

  • Have their institutions certifiably agree to the statements within TCGA DUC

  • Complete the Data Access Request (DAR) form and submit it to the Data Access Committee to be a TCGA Approved User. This form is available electronically through dbGaP

https://wiki.nci.nih.gov/display/TCGA/TCGA+Home

6 / 25

TCGA sample identifiers

  • Each sample has a unique ID (barcode), like TCGA-AO-A128
  • Each barcode can and should be parsed

  • Can be used to distinguish normal and tumor samples (Sample: Tumor types range from 01 - 09, normal types from 10 - 19 and control samples from 20 - 29)
  • Not to be confused with case UUIDs, like 7eea2b6e-771f-44c0-9350-38f45c8dbe87, which are bound to filenames

https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode

10 / 25

PAM50

  • Breast cancer can be classified into 4 major intrinsic subtypes: Luminal A, Luminal B, Her2-enriched, Basal
  • Subtypes are clinically relevant for drug sensitivity and long-term survival
  • Determine tumor subtype by looking at the gene expression of 50 genes

https://xenabrowser.net/datapages/?dataset=TCGA.BRCA.sampleMap/BRCA_clinicalMatrix&host=https://tcga.xenahubs.net

genefu R package for PAM50 classification and survival analysis. https://www.bioconductor.org/packages/release/bioc/html/genefu.html

Parker, Joel S., Michael Mullins, Maggie C. U. Cheang, Samuel Leung, David Voduc, Tammi Vickery, Sherri Davies, et al. “Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes.” Journal of Clinical Oncology: Official Journal of the American Society of Clinical Oncology, (March 10, 2009)

11 / 25

The Broad Institute Genome Data Analysis Center (GDAC) Firehose

  • Standardized, analysis-ready TCGA datasets

    • Aggregated, version-stamped
    • Analysis-ready format / semantics
  • Standardized analyses upon them

    • For vetted algorithms: GISTIC, MutSig, CNMF, ...
    • Companioned with biologist-friendly reports

http://gdac.broadinstitute.org/

12 / 25

Firehose data access

  • fbget - Python application programming interface (API) with >27 functions for Sample-level data, Firehose analyses, Standard data archives, Metadata access

    • Unix command-line access, firehose_get
  • FirebrowseR - An R client for broads firehose pipeline, providing TCGA data sets

  • web-TCGA - a shiny app to access TCGA data from Firebrowse

http://firebrowse.org/

13 / 25

NCI's Genomic Data Commons (GDC)

Launched on June 6, 2016. Provides standardized genomic and clinical data

14 / 25

Accessing GDC

  • The GDC Application Programming Interface (API)

  • GenomicDataCommons - GDC access in R

https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/#api-endpoints

https://bioconductor.org/packages/GenomicDataCommons/

15 / 25

cBioPortal

  • Rich set of tools for visualization, analysis and download of large-scale cancer genomics data sets.

    • Mutations (OncoPrint display)
    • Mutual exclusivity of genetic events (log-odds ratio)
    • Correlations among genetic events (boxplots)
    • Survival (Kaplan-Meier plots)
  • The Onco Query Language (OQL) to fine-tune queries

http://www.cbioportal.org/index.do

http://www.cbioportal.org/tutorial.jsp - short tutorials

Gao, Jianjiong, Bülent Arman Aksoy, Ugur Dogrusoz, Gideon Dresdner, Benjamin Gross, S. Onur Sumer, Yichao Sun, et al. “Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the CBioPortal.” Science Signaling, (April 2, 2013)

16 / 25

cBioPortal data

  • REST-based web API

  • CGDS-R package provides a basic set of functions for querying the Cancer Genomic Data Server (CGDS)

  • MATLAB CGDS Cancer Genomics Toolbox - data access functionality in the MATLAB environment

http://www.cbioportal.org/web_api.jsp

http://www.cbioportal.org/cgds_r.jsp

https://cran.r-project.org/web/packages/cgdsr/vignettes/cgdsr.pdf

17 / 25

R resources to access TCGA data

  • curatedTCGAData - Curated Data From The Cancer Genome Atlas (TCGA) as MultiAssayExperiment Objects

    • MultiAssayExperiment objects integrate multiple assays (e.g. RNA-seq, copy number, mutation, microRNA, protein, and others) with clinical / pathological data.
    • Patient IDs are matched (same number and order) across multiple assays, enabling harmonized subsetting of rows (features) and columns (patients / samples) across the entire experiment.
  • HarmonizedTCGAData - Processed Harmonized TCGA Data of Five Selected Cancer Types

https://bioconductor.org/packages/curatedTCGAData/

https://bioconductor.org/packages/HarmonizedTCGAData/

18 / 25

R resources to access TCGA data

  • curatedOvarianData

    • 30 datasets, > 3K unique samples
    • survival, surgical debulking, histology...
  • curatedCRCData (colorectal)

    • 34 datasets, ~4K unique samples
    • many annotated for MSS, gender, stage, age, N, M
  • curatedBladderData

    • 12 datasets, ~1,200 unique samples
    • many annotated for stage, grade, OS
19 / 25

TCGA packages

  • TCGAbiolinks - an R package for integrative analysis of TCGA data

https://bioconductor.org/packages/TCGAbiolinks/

Colaprico, Antonio, Tiago C. Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S. Sabedot, et al. “TCGAbiolinks: An R/Bioconductor Package for Integrative Analysis of TCGA Data.” Nucleic Acids Research, (May 5, 2016)

20 / 25

TCGA2STAT

  • Well-structured TCGA data access in R

https://CRAN.R-project.org/package=TCGA2STAT

21 / 25

Xena Functional Genomics Explorer

  • Former UCSC Cancer Genomics Browser. Now UCSC Xena

  • Includes TCGA, Cancer Cell Line Encyclopedia, the Stand Up To Cancer (SU2C) Breast Cancer data, custom datasets

  • A tool to visually explore and analyze cancer genomics data and its associated clinical information.

  • Gene- and genome-centric view

  • Survival analysis on user-defined subgroups

https://xenabrowser.net/, https://xenabrowser.net/datapages/, http://xena.ucsc.edu/getting-started/

Cline, Melissa S., Brian Craft, Teresa Swatloski, Mary Goldman, Singer Ma, David Haussler, and Jingchun Zhu. “Exploring TCGA Pan-Cancer Data at the UCSC Cancer Genomics Browser.” Scientific Reports (October 2, 2013)

22 / 25

TCGA analysis on the cloud

  • Goal - simplify centralized access to TCGA data and provide easy analysis

  • Three centers were awarded to develop cloud access

    • Institute for Systems Biology Cancer Genomics Cloud (ISB-CGC)
    • Broad Institute FireCloud
    • Seven Bridges Cancer Genomics Cloud

http://cgc.systemsbiology.net/

https://software.broadinstitute.org/firecloud/

http://www.cancergenomicscloud.org/

23 / 25

Other resources for cancer genomics

Gonzalez-Perez, Abel, Christian Perez-Llamas, Jordi Deu-Pons, David Tamborero, Michael P Schroeder, Alba Jene-Sanz, Alberto Santos, and Nuria Lopez-Bigas. “IntOGen-Mutations Identifies Cancer Drivers across Tumor Types.” Nature Methods, (September 15, 2013)

24 / 25

International Cancer Genome Consortium

  • The International Cancer Genome Consortium (ICGC)’s Pan-Cancer Analysis of Whole Genomes (PCAWG) project aimed to categorize somatic and germline variations in both coding and non-coding regions in over 2,800 cancer patients

  • 5,789 whole genomes of tumors and matched normal tissue spanning 39 tumor types, RNA-Seq profiles were obtained from a subset of 1,284 of the donors

  • Similar to other large-scale genome projects, the ICGC has a Data Coordination Center (DCC)

http://icgc.org/, http://dcc.icgc.org/

25 / 25

The Cancer Genome Atlas (TCGA)

  • Started December 13, 2005, phase II in 2009, ended in 2014

  • Mission - to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing.

  • Data generation

    • Clinical information about participants
    • Metadata about the samples (e.g. the weight of a sample portion, etc.)
    • Histopathology slide images from sample portions
    • Molecular information derived from the samples (e.g. mRNA/miRNA expression, protein expression, copy number, etc.)

https://cancergenome.nih.gov/

2 / 25
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow