Annotations

class: center, middle, inverse, title-slide

# Annotations
### Mikhail Dozmorov
### Virginia Commonwealth University
### 03-01-2021

---

class: center, middle

# Gene annotations

---
## Gene identifiers

**Gene**

- **Ensembl:** ENSG00000139618
- **Entrez Gene:** 675
- **Unigene:** Hs.34012

**RNA transcript**

- **GenBank:** BC026160.1
- **RefSeq:** NM_000059
- **Ensembl:** ENST00000380152

---
## ID cross-mapping

- There are many IDs

- Software tools recognize only a handful

- Humans better recognize gene names

---
## ID challenges

- Avoid errors: map IDs correctly
    - Beware of 1-to-many mappings

- Gene name ambiguity – not a good ID
    - e.g. BMFS5, LFS1, TRP53, p53
    - Better to use the standard gene symbol, not aliases: TP53

- Excel error-introduction
    - OCT4 is changed to October-4  (open file/paste as text)

- Problems reaching 100% cross-mapping
    - E.g. due to version issues
    - Use multiple sources to increase coverage

---
## Reference Sequences (RefSeq)

- Reference standard
    - Eukaryotes: genomic, transcript, protein sequences, derived from: 
        - Computation
        - Manual curation of submitted data
        - Collaboration with other experts

- 69,000 organisms
    - >7000 viruses, >40,000 prokaryotes, >10,000 eukaryotes

.small[ O'Leary, Nuala A., Mathew W. Wright, J. Rodney Brister, Stacy Ciufo, Diana Haddad, Rich McVeigh, Bhanu Rajput et al. "[Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation](https://dx.doi.org/10.1093%2Fnar%2Fgkv1189)." Nucleic acids research, (2016) ]

---
## RefSeq Accession Prefixes

NM\_ = mRNA (experimentally supported)

XM\_ = mRNA (predicted model)

NP\_ = protein (experimentally supported)

XP\_ = protein (predicted model)

NC\_ = genomic/chromosome

NG\_ = incomplete genomic assembly

.small[ https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly ]

---
## HUGO Gene Nomenclature Committee (HGNC) Gene Names

This resource lists gene name synonyms, which is useful if you are conducting a comprehensive literature search and need to find articles about a gene that may have been called other names in the past.

.small[ https://www.genenames.org/ ]

---
## NCBI Gene

.small[ https://www.ncbi.nlm.nih.gov/gene ]

---
## GeneCards

.small[ http://www.genecards.org/ ]

---
## ID conversion helpers

- `clusterProfiler::bitr()` function - Biological Id TranslatoR

- `HGNChelper` R package to correct invalid Human/Mouse Gene Symbols

- `annotables` R package by Stephen Turner, annotating/converting Gene IDs

- `AnnotationDbi` R package for manipulation of SQLite-based annotations

- `biomaRt` R package - Interface to BioMart databases (i.e. Ensembl)

.small[ http://yulab-smu.top/clusterProfiler-book/chapter14.html#bitr

https://CRAN.R-project.org/package=HGNChelper

https://github.com/stephenturner/annotables

https://bioconductor.org/packages/AnnotationDbi/

https://bioconductor.org/packages/biomaRt/ ]

---
## BiomaRt basics

Biomart R package, `biomaRt`, workflow:

- Discover and select an organism-specific mart and dataset
- Select filters, which IDs to convert from
- Select attributes, which IDs to convert to
- Run the query

- For genomic coordinates, use database that corresponds to genome assembly version you are interested in

- Biomart has a web interface, operating on the same principles

---
## BiomaRt

The `getBM()` function has three arguments that need to be introduced: **filters**, **attributes** and **values**.

- **Filters** define a restriction on the query. Tell BiomaRt what kind of IDs do you have, so it will look for it. The `listFilters()` function shows you all available filters in the selected dataset

- **Attributes** define the values we are interested in to retrieve. Which IDs associated with your IDs you want to get. The `listAttributes()` function displays all available attributes in the selected dataset

- **Values** is a vector of IDs you want to convert

---
class: middle, center

# Annotation packages

---
## Annotation packages

- _Bioconductor_ provides extensive access to 'annotation' resources, see the "AnnotationData" biocViews hierarchy.

- `AnnotationDBI` - is a cornerstone of "AnnotationData" packages, provides user interface and database connection code for annotation data packages using SQLite data storage.

.small[ http://bioconductor.org/packages/AnnotationDbi

https://bioconductor.org/packages/release/BiocViews.html#___AnnotationData ]

---
## Annotation packages

- **org** packages (e.g., `org.Hs.eg.db`) contain maps between different gene identifiers, e.g., ENTREZ and SYMBOL. The basic interface to these packages is described on the help page `?select`

- **TxDb** packages (e.g., `TxDb.Hsapiens.UCSC.hg38.knownGene`) contain gene models (exon coordinates, exon / transcript relationships, etc) derived from common sources such as the hg38 `knownGene` track of the UCSC genome browser. These packages can be queried, e.g., as described on the `?exonsBy` page to retrieve all exons grouped by gene or transcript.

.small[ https://bioconductor.org/packages/org.Hs.eg.db

https://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg38.knownGene ]

---
## Annotation packages

- **EnsDb** packages and databases (e.g. `EnsDb.Hsapiens.v86`) provide, similar to TxDb packages, gene models, but also protein annotations (protein sequences and protein domains within these) and additional annotation columns such as `"gene_biotype"` or `"tx_biotype"` defining the *biotype* of the features (e.g. lincRNA, protein_coding, miRNA etc). `EnsDb` databases are designed for Ensembl annotations and contain annotations for all genes (protein coding and non-coding) for a specific Ensembl release.

- **BSgenome** packages (e.g., `BSgenome.Hsapiens.UCSC.hg19`) contain whole genomes of model organisms. See `available.genomes()` for pre-packaged genomes.
    
.small[ https://bioconductor.org/packages/EnsDb.Hsapiens.v86/

https://bioconductor.org/packages/BSgenome/ ]

---
## Annotation methods

- Annotation packages usually contain an object named after the package itself.  These objects are collectively called `AnnotationDb` objects with more specific classes named `OrgDb`, `ChipDb` or `TranscriptDb` objects.

- Methods that can be applied to these objects include `cols()`, `keys()`, `keytypes()` and `select()`.

---
## Annotation methods

| Category   | Function                              | Description                                                      |
|------------|---------------------------------------|------------------------------------------------------------------|
| Discover   | `columns()`                           | List the kinds of columns that can be returned                   |
|            | `keytypes()`                          | List columns that can be used as keys                            |
|            | `keys()`                              | List values that can be expected for a given keytype             |
|            | `select()`                            | Retrieve annotations matching `keys`, `keytype` and `columns`    |

---
## Annotation methods

| Category   | Function                              | Description                                                      |
|------------|---------------------------------------|------------------------------------------------------------------|
| Manipulate | `setdiff()`, `union()`, `intersect()` | Operations on sets                                               |
|            | `duplicated()`, `unique()`            | Mark or remove duplicates                                        |
|            | `%in%`,  `match()`                    | Find matches                                                     |
|            | `any()`, `all()`                      | Are any `TRUE`?  Are all?                                        |
|            | `merge()`                             | Combine two different data.frames based on shared keys           |

---
## Annotation methods

| Category   | Function                              | Description                                                      |
|------------|---------------------------------------|------------------------------------------------------------------|
|  GRanges*  | `transcripts()`, `exons()`, `cds()`   | Features (transcripts, exons, coding sequence) as `GRanges`.     |
|            | `transcriptsBy()` , `exonsBy()`       | Features group by  gene, transcript, etc., as `GRangesList`.     |
|            | `cdsBy()`                             |                                                                  |

---
## KEGG

- KEGG: Kyoto Encyclopedia of Genes and Genomes

- KEGG API R package, `KEGGREST`
    - Essential operations outlined in the vignette

.small[ http://www.genome.jp/kegg/pathway.html

https://bioconductor.org/packages/KEGGREST

http://bioconductor.org/packages/release/bioc/vignettes/KEGGREST/inst/doc/KEGGREST-vignette.html ]

---
## AnnotationHub

- `AnnotationHub` package - curated database of large-scale whole-genome resources, e.g., regulatory elements from the Roadmap Epigenomics project, Ensembl GTF and FASTA files for model and other organisms. Examples of use include:
    - Easily access and import Roadmap Epigenomics files.
    - `liftOver` genomic range-based annotations from one coordinate system (e.g, `hg19`) to another (e.g., `GRCh38`).
    - Create `TranscriptDb` and `BSgenome`-style annotation resources 'on the fly' for a diverse set of organisms.
    - Programmatically access the genomic coordiantes of clinically relevant variants cataloged in dbSNP.

- Related packages: `ExperimentHub` - curated data sets

.small[ https://bioconductor.org/packages/AnnotationHub ]