Gene
RNA transcript
There are many IDs
Software tools recognize only a handful
Humans better recognize gene names
Avoid errors: map IDs correctly
Gene name ambiguity – not a good ID
Excel error-introduction
Problems reaching 100% cross-mapping
Reference standard
69,000 organisms
7000 viruses, >40,000 prokaryotes, >10,000 eukaryotes
O'Leary, Nuala A., Mathew W. Wright, J. Rodney Brister, Stacy Ciufo, Diana Haddad, Rich McVeigh, Bhanu Rajput et al. "Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation." Nucleic acids research, (2016)
NM_ = mRNA (experimentally supported)
XM_ = mRNA (predicted model)
NP_ = protein (experimentally supported)
XP_ = protein (predicted model)
NC_ = genomic/chromosome
NG_ = incomplete genomic assembly
https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly
This resource lists gene name synonyms, which is useful if you are conducting a comprehensive literature search and need to find articles about a gene that may have been called other names in the past.
clusterProfiler::bitr()
function - Biological Id TranslatoR
HGNChelper
R package to correct invalid Human/Mouse Gene Symbols
annotables
R package by Stephen Turner, annotating/converting Gene IDs
AnnotationDbi
R package for manipulation of SQLite-based annotations
biomaRt
R package - Interface to BioMart databases (i.e. Ensembl)
http://yulab-smu.top/clusterProfiler-book/chapter14.html#bitr
https://CRAN.R-project.org/package=HGNChelper
https://github.com/stephenturner/annotables
Biomart R package, biomaRt
, workflow:
Run the query
For genomic coordinates, use database that corresponds to genome assembly version you are interested in
Biomart has a web interface, operating on the same principles
The getBM()
function has three arguments that need to be introduced: filters, attributes and values.
Filters define a restriction on the query. Tell BiomaRt what kind of IDs do you have, so it will look for it. The listFilters()
function shows you all available filters in the selected dataset
Attributes define the values we are interested in to retrieve. Which IDs associated with your IDs you want to get. The listAttributes()
function displays all available attributes in the selected dataset
Values is a vector of IDs you want to convert
Bioconductor provides extensive access to 'annotation' resources, see the "AnnotationData" biocViews hierarchy.
AnnotationDBI
- is a cornerstone of "AnnotationData" packages, provides user interface and database connection code for annotation data packages using SQLite data storage.
http://bioconductor.org/packages/AnnotationDbi
https://bioconductor.org/packages/release/BiocViews.html#___AnnotationData
org packages (e.g., org.Hs.eg.db
) contain maps between different gene identifiers, e.g., ENTREZ and SYMBOL. The basic interface to these packages is described on the help page ?select
TxDb packages (e.g., TxDb.Hsapiens.UCSC.hg38.knownGene
) contain gene models (exon coordinates, exon / transcript relationships, etc) derived from common sources such as the hg38 knownGene
track of the UCSC genome browser. These packages can be queried, e.g., as described on the ?exonsBy
page to retrieve all exons grouped by gene or transcript.
https://bioconductor.org/packages/org.Hs.eg.db
https://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg38.knownGene
EnsDb packages and databases (e.g. EnsDb.Hsapiens.v86
) provide, similar to TxDb packages, gene models, but also protein annotations (protein sequences and protein domains within these) and additional annotation columns such as "gene_biotype"
or "tx_biotype"
defining the biotype of the features (e.g. lincRNA, protein_coding, miRNA etc). EnsDb
databases are designed for Ensembl annotations and contain annotations for all genes (protein coding and non-coding) for a specific Ensembl release.
BSgenome packages (e.g., BSgenome.Hsapiens.UCSC.hg19
) contain whole genomes of model organisms. See available.genomes()
for pre-packaged genomes.
Annotation packages usually contain an object named after the package itself. These objects are collectively called AnnotationDb
objects with more specific classes named OrgDb
, ChipDb
or TranscriptDb
objects.
Methods that can be applied to these objects include cols()
, keys()
, keytypes()
and select()
.
Category | Function | Description |
---|---|---|
Discover | columns() |
List the kinds of columns that can be returned |
keytypes() |
List columns that can be used as keys | |
keys() |
List values that can be expected for a given keytype | |
select() |
Retrieve annotations matching keys , keytype and columns |
Category | Function | Description |
---|---|---|
Manipulate | setdiff() , union() , intersect() |
Operations on sets |
duplicated() , unique() |
Mark or remove duplicates | |
%in% , match() |
Find matches | |
any() , all() |
Are any TRUE ? Are all? |
|
merge() |
Combine two different data.frames based on shared keys |
Category | Function | Description |
---|---|---|
GRanges* | transcripts() , exons() , cds() |
Features (transcripts, exons, coding sequence) as GRanges . |
transcriptsBy() , exonsBy() |
Features group by gene, transcript, etc., as GRangesList . |
|
cdsBy() |
KEGG: Kyoto Encyclopedia of Genes and Genomes
KEGG API R package, KEGGREST
http://www.genome.jp/kegg/pathway.html
https://bioconductor.org/packages/KEGGREST
http://bioconductor.org/packages/release/bioc/vignettes/KEGGREST/inst/doc/KEGGREST-vignette.html
AnnotationHub
package - curated database of large-scale whole-genome resources, e.g., regulatory elements from the Roadmap Epigenomics project, Ensembl GTF and FASTA files for model and other organisms. Examples of use include:
liftOver
genomic range-based annotations from one coordinate system (e.g, hg19
) to another (e.g., GRCh38
).TranscriptDb
and BSgenome
-style annotation resources 'on the fly' for a diverse set of organisms.Related packages: ExperimentHub
- curated data sets
https://bioconductor.org/packages/AnnotationHub
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |