+ - 0:00:00
Notes for current slide
Notes for next slide

Genomic technologies

Mikhail Dozmorov

Virginia Commonwealth University

02-08-2021

1 / 47

Age of OMICS

http://journal.frontiersin.org/article/10.3389/fpls.2014.00244/full
2 / 47

Genome arithmetics

  • ~3,235 billion base pairs (haploid)
  • ~20,000 protein coding genes
  • ~200,000 coding transcripts (isoforms of a gene that each encode a distinct protein product)
http://uswest.ensembl.org/Homo_sapiens/Location/Genome
3 / 47

Genes are unevenly distributed on chromosomes

  • Highly expressed genes positively correlated with:

    • Very short indels
    • High gene density
    • High GC content
    • High density of Short interspersed nuclear elements (SINE) repeats
    • Low density of Long interspersed nuclear elements (LINE) repeats
    • Both housekeeping and tissue-specific expression
  • The opposite is true for lowly expressed genes

Versteeg, Rogier, Barbera D. C. van Schaik, Marinus F. van Batenburg, Marco Roos, Ramin Monajemi, Huib Caron, Harmen J. Bussemaker, and Antoine H. C. van Kampen. “The Human Transcriptome Map Reveals Extremes in Gene Density, Intron Length, GC Content, and Repeat Pattern for Domains of Highly and Weakly Expressed Genes.” Genome Research 13, no. 9 (September 2003): 1998–2004. https://doi.org/10.1101/gr.1649303.
4 / 47

Genes are unevenly distributed on chromosomes

Chromosome 19 is the most gene dense chromosome in the human genome

5 / 47

Half of the human genome is low complexity

Retrotransposons - fossil records of evolution

  • McClintock's "jumping genes" in maize
  • Retrotransposons use a "copy/paste" mechanism - transcribed to RNA and then reverse transcribed into DNA and insert
  • DNA transposons use a "cut/paste" mechanism - excise themselves and insert to another place
https://www.ncbi.nlm.nih.gov/pubmed/19763152
6 / 47

Genome variability

A typical human genome differs from the reference genome at 4.1 to 5.0 million sites - Single Nucleotide Polymorphisms (SNPs)

  • Over 99.9% are SNPs or short indels
  • Only 1-4% are rare (frequency <0.5% in the population)
  • Contains 2,100 – 2,500 structural variants, which affect more bases (~20 million bases)
  • ~1,000 large deletions
  • ~1,094 Alu, L1, SINE (short interspersed nuclear element), VNTR (variable number tandem repeat) insertions
  • ~160 CNVs
  • ~10 inversions
  • ~ 4 NUMTs (nuclear mitochondrial DNA variations)
Conrad DF, Keebler JE, DePristo MA, Lindsay SJ, Zhang Y, Casals F, Idaghdour Y, Hartl CL, Torroja C, Garimella KV, Zilversmit M, Cartwright R, Rouleau GA, Daly M, Stone EA, Hurles ME, Awadalla P; 1000 Genomes Project. Variation in genome-wide mutation rates within and between human families. Nat Genet. 2011 Jun 12;43(7):712-4. doi: 10.1038/ng.862. https://www.nature.com/articles/ng.862
7 / 47

Why sequence a reference genome?

  • Determine the "complete" sequence of a human haploid genome

  • Identify the sequence and location of every protein coding gene

  • Use as a "map" with which to track the location and frequency of genetic variation in the human genome

  • Unravel the genetic architecture of inherited and somatic human diseases

  • Understand genome and species evolution

8 / 47

DNA sequencing: Maxam-Gilbert, Sanger

  • Sequencing by synthesis (not degradation)

  • Radioactive primers hybridize to DNA

  • Polymerase + dNTPs (normal dNTPs) + ddNTP (dideoxynucleotides terminators) at low concentration

  • 1 lane per base, visually interpret ladder

9 / 47

How to sequence a human genome: Lee Hood automation

The Human Genome project: Early days

Green, Eric D., James D. Watson, and Francis S. Collins. "Human Genome Project: Twenty-Five Years of Big Biology." Nature 526, no. 7571 (October 1, 2015): 29–31. doi:10.1038/526029a.
10 / 47

The competing human genome projects

11 / 47

A first map of the human genome

http://www.nature.com/nature/journal/v409/n6822/full/409860a0.html
12 / 47

Human genome is sequenced!

13 / 47

The Human Genome roadmap

 
https://www.davidstreams.com/mis-apuntes/human-genome-project/
14 / 47

Evolution of sequencing technologies

  • "Massively parallel" sequencing

  • "High-throughput" sequencing

  • "Ultra high-throughput" sequencing

  • "Next generation" sequencing (NGS)

  • "Second generation" sequencing

15 / 47

Evolution of sequencing technologies

  • 2005: 454 (Roche)

  • 2006: Solexa (Illumina)

  • 2007: ABI/SOLiD (Life Technologies)

  • 2010: Complete Genomics

  • 2011: Pacific Biosciences

  • 2010: Ion Torrent (Life Technologies)

  • 2015: Oxford Nanopore Technologies

16 / 47

Sequencing in a nutshell

  • Cut the long DNA into smaller segments (several hundreds to several thousand bases)

  • Sequence each segment: start from one end and sequence along the chain, base by base

  • The process stops after a while because the noise level is too high

  • Results from sequencing are many sequence pieces. The lengths vary, usually a few thousands from Sanger, and several hundreds from NGS

  • The sequence pieces are called "reads" for NGS data

17 / 47

Solexa (Illumina) sequencing (2006)

  • PCR amplify DNA fragments

  • Immobilize fragments on a solid surface, amplify

  • Reversible terminator sequencing with 4 color dye-labelled nucleotides

Video of Illumina sequencing, http://www.youtube.com/watch?v=77r5p8IBwJk (1.5m), https://www.youtube.com/watch?v=fCd6B5HRaZ8 (5m)

18 / 47

Cluster amplification by "bridge" PCR

https://binf.snipcademy.com/lessons/ngs-techniques/bridge-pcr
19 / 47

Clonal amplification

20 / 47

Base calling

  • 6 cycles with base-calling
https://www.youtube.com/watch?v=IzXQVwWYFv4 https://www.youtube.com/watch?time_continue=65&v=tuD-ST5B3QA
21 / 47

Illumina sequencers

  • Massive improvement of the cluster density - higher output
  • Less expensive than the previous sequencers, Faster runs
https://www.illumina.com/systems/sequencing-platforms.html https://blog.genohub.com/2017/01/10/illumina-unveils-novaseq-5000-and-6000/ http://www.mrdnalab.com/illumina-novaseq.html
22 / 47

Solexa (Illumina) sequencing: summary

Advantages:

  • Best throughput, accuracy and read length for any 2nd gen. sequencer
  • Fast & robust library preparation

Disadvantages:

  • Inherent limits to read length (practically, 150bp)
  • Some runs are error prone
Video of Illumina sequencing https://www.youtube.com/watch?v=womKfikWlxM (5m)
23 / 47

Single-end vs. paired-end sequencing

  • Single-end sequencing: sequence one end of the DNA segment.

  • Paired-end sequencing: sequence both ends of a DNA segments.

    • Result reads are "paired", separated by certain length (the length of the DNA segments, usually a few hundred bps).
    • Paired-end data can be used as single-end, but contain extra information which is useful in some cases, e.g., detecting structural variations in the genome.
    • Modeling technique is more complicated.
24 / 47

Paired-end sequencing - a workaround to sequence longer fragments

  • Read one end of the molecule, flip, and read the other end
  • Generate pair of reads separated by up to 500bp with inward orientation
25 / 47

Sequencing applications

NGS has a wide range of applications

  • DNA-seq: sequence genomic DNA

  • RNA-seq: sequence RNA products

  • ChIP-seq: detect protein-DNA interaction sites

  • Bisulfite sequencing (BS-seq): measure DNA methylation strengths

  • A lot of others

Basically replaced microarrays with better data: greater dynamic range and higher signal-to-noise ratios.

26 / 47

DNA-seq (Whole-Genome sequencing)

  • Sequence the untreated genomic DNA.

    • Obtain DNA from cells, cut into small pieces then sequence the segments.
  • Goals: Compare with the reference genome and look for genetic variants

  • Single nucleotide polymorphisms (SNPs)

  • Insertions/deletions (indels),
  • Copy number variations (CNVs)
  • Other structural variations (gene fusion, etc.).
    • De novo assembly of a new genome.
27 / 47

Variations of DNA-seq

  • Targeted sequencing, e.g., exome sequencing

    • Sequence the genomic DNA at targeted genomic regions
    • Cheaper than whole genome DNA-seq, so that money can be spent to get bigger sample size (more individuals)
    • The targeted genomic regions need to be "captured" first using technologies like microarrays
  • Metagenomic sequencing

    • Sequence the DNA of a mixture of species, mostly microbes, in order to understand the microbial environments
    • The goal is to determine number of species, their genome and proportions in the population
    • De novo assembly is required. But the number and proportions of species are unknown, so it poses challenge to assembly
28 / 47

RNA-seq

Sequence the "transcriptome": the set of RNA molecules

Goals

  • Catalogue RNA products
  • Determine transcriptional structures: alternative splicing, gene fusion, etc.

  • Quantify gene expression: the sequencing version of gene expression microarray

29 / 47

ChIP-seq

  • Chromatin-Immunoprecipitation (ChIP) followed by sequencing (seq): sequencing version of ChIP-chip

  • Used to detect locations of certain "events" on the genome:

    • Transcription factor binding
    • DNA methylations and histone modifications
  • A type of "captured" sequencing. ChIP step is to capture genomic regions of interest

30 / 47

What matters is what you feed into the sequencing machine

https://liorpachter.wordpress.com/seq/
31 / 47

Extra

32 / 47

GENCODE – Annotation Gene Features

  • ~21,000 protein coding genes

  • PolyA+

    • Almost completely spliced before nuclear export – co-trascriptional splicing "first transcribed – first spliced"
    • Most have at least 2 dominate splice forms
    • Show allele specific expression – potential imprinting
  • PolyA-

    • Many are lncRNAs
    • Also shows allele specific expression

https://www.gencodegenes.org/

33 / 47

GENCODE – Annotation Gene Features

  • Most (62%) of the genome is transcribed

    • <5% can be identified as exons
  • ~12,000 pseudogenes – results of duplications

    • 876 are transcribed – can have regulatory function as decoys
    • Infrequently spliced
  • ~10,000 lncRNA = noncoding RNAs >200bp

    • 92% are not translated
    • Show tissue-specific expression – more than protein coding genes
    • 33% are primate specific but few are human specific – most new genes are in this category
    • Poorly spliced – most are two exon transcripts
34 / 47

GENCODE – Annotation Gene Features

  • ~9000 small RNAs - many of the lncRNA transcripts are processed into stable small RNAs

    • tRNA, miRNA, siRNA, snRNA, snoRNA
  • ~82,000 – 128,000 transcription start sites - depending on detection method

    • ~44% are near annotated transcripts
  • ~5,000 RNA edits occur post transcription

    • Mostly A to G(I) conversions (APOBEC pathway)
    • 94% are in transcribed repeat elements
      • Remaining are mostly in introns, 3’UTRs
      • Very few (123) in protein coding sequences
35 / 47

ION Torrent-pH Sensing of Base Incorporation

36 / 47

Platforms: Ion Torrent

  • Low substitution error rate, in/dels problematic, no paired end reads
  • Inexpensive and fast turn-around for data production
  • Improved computational workflows for analysis
37 / 47

Pacific Biosciences: Long reads

  • Structural variant discovery
  • De novo genome assembly

https://www.forbes.com/forbes/2009/1005/revolutionaries-science-genomics-gene-machine.html

38 / 47

Pacific Biosciences: summary

Key Points:

  • 1 DNA molecule and 1 polymerase in each well (zero-mode waveguide)
  • 4 colors flash in real time as polymerase acts
  • Methylated cytosine has distinct pattern
  • No theoretical limit to DNA fragment length

Caveats:

  • Higher error rate (1-2%), but they are random
  • Lower throughput, roughly 5 gigabases per run
39 / 47

Nanopore sequencing

  • Nearly 30-years old technology
http://www2.technologyreview.com/news/427677/nanopore-sequencing/
40 / 47

Nanopore sequencing

  • Nanopore sequencing with ONT is accurate and relatively reliable
  • Current yield per run: ~5 Gbp, 97% identity (i.e., 3% error rate)
https://www.technologyreview.com/s/600887/with-patent-suit-illumina-looks-to-tame-emerging-british-rival-oxford-nanopore/ Video of Ion Torrent chemistry, http://www.youtube.com/watch?v=yVf2295JqUg (2.5m)
41 / 47

Nanopore sequencing

  • Key advantage - portability
Video of Nanopore DNA sequencint technology https://www.youtube.com/watch?v=CE4dW64x3Ts (4.5m) https://phys.org/news/2016-08-nasa-dna-sequencing-space-success.html
42 / 47

Nanopore for human genome sequencing

  • Closes 12 gaps
  • Phased the entire major histocompatibility complex (MHC) region, one of the most gene-dense and highly variable regions of the genome
Jain, Miten, Sergey Koren, Karen H Miga, Josh Quick, Arthur C Rand, Thomas A Sasani, John R Tyson, et al. “Nanopore Sequencing and Assembly of a Human Genome with Ultra-Long Reads.” Nature Biotechnology, January 29, 2018. https://doi.org/10.1038/nbt.4060. https://www.genengnews.com/gen-exclusives/first-nanopore-sequencing-of-human-genome/77901044
43 / 47

Nanopore technology

  • Nanopore sequencing yields raw signals reflecting modulation of the ionic current at each pore by a DNA molecule.
  • The resulting time-series of nanopore translocation, ‘events’, are base-called by proprietary software running as a cloud service.
https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btu555
44 / 47

Nanopore base callers

  • Proper base calling is a paramount, as it defines whether the technology is good or bad.
  • Nanonet, Albacore, Scrappie
  • Most modern basecallers use neural networks.
https://github.com/rrwick/Basecalling-comparison
45 / 47

Nanopore analysis

  • The resulting files for each sequenced read are stored in ‘FAST5’ format, an application of the HDF5 format.
  • poretools - a toolkit for analyzing nanopore sequence data.
https://github.com/arq5x/poretools https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btu555
46 / 47

PacBio vs. Oxford Nanopore sequencing

https://blog.genohub.com/2017/06/16/pacbio-vs-oxford-nanopore-sequencing/
47 / 47

Age of OMICS

http://journal.frontiersin.org/article/10.3389/fpls.2014.00244/full
2 / 47
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow