+ - 0:00:00
Notes for current slide
Notes for next slide

Annotations

Mikhail Dozmorov

Virginia Commonwealth University

03-01-2021

1 / 23

Gene annotations

2 / 23

Gene identifiers

Gene

  • Ensembl: ENSG00000139618
  • Entrez Gene: 675
  • Unigene: Hs.34012

RNA transcript

  • GenBank: BC026160.1
  • RefSeq: NM_000059
  • Ensembl: ENST00000380152
3 / 23

ID cross-mapping

  • There are many IDs

  • Software tools recognize only a handful

  • Humans better recognize gene names

4 / 23

ID challenges

  • Avoid errors: map IDs correctly

    • Beware of 1-to-many mappings
  • Gene name ambiguity – not a good ID

    • e.g. BMFS5, LFS1, TRP53, p53
    • Better to use the standard gene symbol, not aliases: TP53
  • Excel error-introduction

    • OCT4 is changed to October-4 (open file/paste as text)
  • Problems reaching 100% cross-mapping

    • E.g. due to version issues
    • Use multiple sources to increase coverage
5 / 23

Reference Sequences (RefSeq)

  • Reference standard

    • Eukaryotes: genomic, transcript, protein sequences, derived from:
      • Computation
      • Manual curation of submitted data
      • Collaboration with other experts
  • 69,000 organisms

    • 7000 viruses, >40,000 prokaryotes, >10,000 eukaryotes

O'Leary, Nuala A., Mathew W. Wright, J. Rodney Brister, Stacy Ciufo, Diana Haddad, Rich McVeigh, Bhanu Rajput et al. "Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation." Nucleic acids research, (2016)

6 / 23

RefSeq Accession Prefixes

NM_ = mRNA (experimentally supported)

XM_ = mRNA (predicted model)

NP_ = protein (experimentally supported)

XP_ = protein (predicted model)

NC_ = genomic/chromosome

NG_ = incomplete genomic assembly

https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly

7 / 23

HUGO Gene Nomenclature Committee (HGNC) Gene Names

This resource lists gene name synonyms, which is useful if you are conducting a comprehensive literature search and need to find articles about a gene that may have been called other names in the past.

https://www.genenames.org/

8 / 23

GeneCards

http://www.genecards.org/

10 / 23

ID conversion helpers

  • clusterProfiler::bitr() function - Biological Id TranslatoR

  • HGNChelper R package to correct invalid Human/Mouse Gene Symbols

  • annotables R package by Stephen Turner, annotating/converting Gene IDs

  • AnnotationDbi R package for manipulation of SQLite-based annotations

  • biomaRt R package - Interface to BioMart databases (i.e. Ensembl)

11 / 23

BiomaRt basics

Biomart R package, biomaRt, workflow:

  • Discover and select an organism-specific mart and dataset
  • Select filters, which IDs to convert from
  • Select attributes, which IDs to convert to
  • Run the query

  • For genomic coordinates, use database that corresponds to genome assembly version you are interested in

  • Biomart has a web interface, operating on the same principles

12 / 23

BiomaRt

The getBM() function has three arguments that need to be introduced: filters, attributes and values.

  • Filters define a restriction on the query. Tell BiomaRt what kind of IDs do you have, so it will look for it. The listFilters() function shows you all available filters in the selected dataset

  • Attributes define the values we are interested in to retrieve. Which IDs associated with your IDs you want to get. The listAttributes() function displays all available attributes in the selected dataset

  • Values is a vector of IDs you want to convert

13 / 23

Annotation packages

14 / 23

Annotation packages

  • Bioconductor provides extensive access to 'annotation' resources, see the "AnnotationData" biocViews hierarchy.

  • AnnotationDBI - is a cornerstone of "AnnotationData" packages, provides user interface and database connection code for annotation data packages using SQLite data storage.

15 / 23

Annotation packages

  • org packages (e.g., org.Hs.eg.db) contain maps between different gene identifiers, e.g., ENTREZ and SYMBOL. The basic interface to these packages is described on the help page ?select

  • TxDb packages (e.g., TxDb.Hsapiens.UCSC.hg38.knownGene) contain gene models (exon coordinates, exon / transcript relationships, etc) derived from common sources such as the hg38 knownGene track of the UCSC genome browser. These packages can be queried, e.g., as described on the ?exonsBy page to retrieve all exons grouped by gene or transcript.

16 / 23

Annotation packages

  • EnsDb packages and databases (e.g. EnsDb.Hsapiens.v86) provide, similar to TxDb packages, gene models, but also protein annotations (protein sequences and protein domains within these) and additional annotation columns such as "gene_biotype" or "tx_biotype" defining the biotype of the features (e.g. lincRNA, protein_coding, miRNA etc). EnsDb databases are designed for Ensembl annotations and contain annotations for all genes (protein coding and non-coding) for a specific Ensembl release.

  • BSgenome packages (e.g., BSgenome.Hsapiens.UCSC.hg19) contain whole genomes of model organisms. See available.genomes() for pre-packaged genomes.

17 / 23

Annotation methods

  • Annotation packages usually contain an object named after the package itself. These objects are collectively called AnnotationDb objects with more specific classes named OrgDb, ChipDb or TranscriptDb objects.

  • Methods that can be applied to these objects include cols(), keys(), keytypes() and select().

18 / 23

Annotation methods

Category Function Description
Discover columns() List the kinds of columns that can be returned
keytypes() List columns that can be used as keys
keys() List values that can be expected for a given keytype
select() Retrieve annotations matching keys, keytype and columns
19 / 23

Annotation methods

Category Function Description
Manipulate setdiff(), union(), intersect() Operations on sets
duplicated(), unique() Mark or remove duplicates
%in%, match() Find matches
any(), all() Are any TRUE? Are all?
merge() Combine two different data.frames based on shared keys
20 / 23

Annotation methods

Category Function Description
GRanges* transcripts(), exons(), cds() Features (transcripts, exons, coding sequence) as GRanges.
transcriptsBy() , exonsBy() Features group by gene, transcript, etc., as GRangesList.
cdsBy()
21 / 23

KEGG

  • KEGG: Kyoto Encyclopedia of Genes and Genomes

  • KEGG API R package, KEGGREST

    • Essential operations outlined in the vignette
22 / 23

AnnotationHub

  • AnnotationHub package - curated database of large-scale whole-genome resources, e.g., regulatory elements from the Roadmap Epigenomics project, Ensembl GTF and FASTA files for model and other organisms. Examples of use include:

    • Easily access and import Roadmap Epigenomics files.
    • liftOver genomic range-based annotations from one coordinate system (e.g, hg19) to another (e.g., GRCh38).
    • Create TranscriptDb and BSgenome-style annotation resources 'on the fly' for a diverse set of organisms.
    • Programmatically access the genomic coordiantes of clinically relevant variants cataloged in dbSNP.
  • Related packages: ExperimentHub - curated data sets

https://bioconductor.org/packages/AnnotationHub

23 / 23

Gene annotations

2 / 23
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow