+ - 0:00:00
Notes for current slide
Notes for next slide

https://bioconductor.org/

Mikhail Dozmorov

Virginia Commonwealth University

02-22-2021

1 / 16

High-throughput sequence workflow

2 / 16

Bioconductor

https://bioconductor.org/

Analysis and comprehension of high-throughput genomic data

  • Statistical analysis designed for large genomic data

  • Interpretation: biological context, visualization, reproducibility

  • Support for all high-throughput technologies

    • Sequencing: RNASeq, ChIPSeq, variants, copy number, ...
    • Microarrays: expression, SNP, ...
    • Flow cytometry, proteomics, images, ...

Bioconductor cheat sheet https://github.com/mikelove/bioc-refcard

3 / 16

Bioconductor by the numbers

  • Project started in 2002

  • Built on and in R, the open source software platform for data science

  • An estimated 2,000,000 users worldwide

  • More than 50,000 unique downloads per month

  • More than 22,000 PubmedCentral citations

  • Bioconductor Release: 1,974 biomedical and omics data science software packages (02-20-2021)

  • Receiving submissions of 3-6 new packages per week

Bioconductor: Software for orchestrating high-throughput biological data analysis by Sean Davis

4 / 16

Reference manuals, vignettes

  • All user-visible functions have help pages, most with runnable examples

  • 'Vignettes' an important feature in Bioconductor -- narrative documents illustrating how to use the package, with integrated code

  • Example: AnnotationHub landing page, AnnotationHub HOW TO's vignette illustrating some fun use cases

https://bioconductor.org/packages/AnnotationHub/

5 / 16

Bioconductor classes

  • Bioconductor makes extensive use of classes to represent complicated data types

    • The core components: classes, generic functions and methods
    • The S4 class system is a set of facilities for object-oriented programming
  • Classes foster interoperability - many different packages can work on the same data - but can be a bit intimidating

6 / 16

Formal S4 object system

  • Often a class is described on a particular home page, e.g., ?GRanges, and in vignettes, e.g., vignette(package="GenomicRanges"), vignette("GenomicRangesIntroduction")

  • Many methods and classes can be discovered interactively , e.g., methods(class="GRanges") to find out what one can do with a GRanges instance, and methods(findOverlaps) for classes that the findOverlaps() function operates on

  • In more advanced cases, one can look at the actual definition of a class or method using getClass(), getMethod()

  • Getting help:?findOverlaps,<tab> to select help on a specific method, ?GRanges-class for help on a class.

7 / 16

High-throughput sequence data

8 / 16

DNA/amino acid sequences: FASTA files

  • The Biostrings package is used to represent DNA sequences, with many convenient sequence-related functions, e.g., ?consensusMatrix.

Input & manipulation, FASTA file example:

>NM_078863_up_2000_chr2L_16764737_f chr2L:16764737-16766736
gttggtggcccaccagtgccaaaatacacaagaagaagaaacagcatctt
gacactaaaatgcaaaaattgctttgcgtcaatgactcaaaacgaaaatg
...
atgggtatcaagttgccccgtataaaaggcaagtttaccggttgcacggt
>NM_001201794_up_2000_chr2L_8382455_f chr2L:8382455-8384454
ttatttatgtaggcgcccgttcccgcagccaaagcactcagaattccggg
cgtgtagcgcaacgaccatctacaaggcaatattttgatcgcttgttagg
...

http://bioconductor.org/packages/Biostrings

9 / 16

Reads: FASTQ files

  • The ShortRead package can be used for lower-level access to FASTQ files. readFastq(), FastqStreamer(), FastqSampler()

Input & manipulation, FASTQ file example:

@ERR127302.1703 HWI-EAS350_0441:1:1:1460:19184#0/1
CCTGAGTGAAGCTGATCTTGATCTACGAAGAGAGATAGATCTTGATCGTCGAGGAGATGCTGACCTTGACCT
+
HHGHHGHHHHHHHHDGG<GDGGE@GDGGD<?B8??ADAD<BE@EE8EGDGA3CB85*,77@>>CE?=896=:
@ERR127302.1704 HWI-EAS350_0441:1:1:1460:16861#0/1
GCGGTATGCTGGAAGGTGCTCGAATGGAGAGCGCCAGCGCCCCGGCGCTGAGCCGCAGCCTCAGGTCCGCCC
+
DE?DD>ED4>EEE>DE8EEEDE8B?EB<@3;BA79?,881B?@73;1?---#####################

http://bioconductor.org/packages/ShortRead

Quality scores: 'phred-like', encoded. See http://en.wikipedia.org/wiki/FASTQ_format#Encoding

10 / 16

Biostrings, DNA or amino acid sequences

Classes

  • XString, XStringSet, e.g., DNAString (genomes), DNAStringSet (reads)

Methods

  • Manipulation, e.g., reverseComplement()
  • Summary, e.g., letterFrequency()
  • Matching, e.g., matchPDict(), matchPWM()

Related packages: BSgenome for working with whole genome sequences, e.g., ?"getSeq,BSgenome-method"

11 / 16

Aligned reads: SAM/BAM files

Input & manipulation: Rsamtools - scanBam(), BamFile()

SAM Header example

@HD VN:1.0 SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
...
@SQ SN:chrY LN:59373566
12 / 16

GenomicAlignments, Aligned reads

The GenomicAlignments package is used to input reads aligned to a reference genome. See for instance the ?readGAlignments help page and vignette(package="GenomicAlignments", "summarizeOverlaps")

Classes - GenomicRanges-like behaivor

  • GAlignments, GAlignmentPairs, GAlignmentsList

Methods

  • readGAlignments(), readGAlignmentsList()
    • Easy to restrict input, iterate in chunks
  • summarizeOverlaps()
13 / 16

Genomic variants: VCF files

  • VariantAnnotation - Input and annotation of genomic variants

Classes - GenomicRanges-like behavior

  • VCF -- 'wide'
  • VRanges -- 'tall'

Methods

  • I/O and filtering: readVcf(), readGeno(), readInfo(), readGT(), writeVcf(), filterVcf()
  • Annotation: locateVariants() (variants overlapping ranges), predictCoding(), summarizeVariants()
  • SNPs: genotypeToSnpMatrix(), snpSummary()

http://bioconductor.org/packages/VariantAnnotation

14 / 16
  • ensemblVEP- query the Ensembl Variant Effect Predictor

  • VariantTools - Explore, diagnose, and compare variant calls.

  • VariantFiltering - Filtering of coding and non-coding genetic variants.

  • h5vc - has variant calling functionality.

  • snpStats - Classes and statistical methods for large SNP association studies.

15 / 16

Genome annotations: BED, WIG, GTF, etc. files

  • The rtracklayer's import and export functions can read in many common file types, e.g., BED, WIG, GTF, ..., in addition to querying and navigating the UCSC genome browser. Check out the ?import page for basic usage.

Input: rtracklayer::import()

http://bioconductor.org/packages/rtracklayer

16 / 16

High-throughput sequence workflow

2 / 16
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow