Genomic technologies

Genomic technologiesMikhail DozmorovVirginia Commonwealth University02-08-20211 / 47

Age of OMICS

http://journal.frontiersin.org/article/10.3389/fpls.2014.00244/full

2 / 47

Genome arithmetics

~3,235 billion base pairs (haploid)
~20,000 protein coding genes
~200,000 coding transcripts (isoforms of a gene that each encode a distinct protein product)

http://uswest.ensembl.org/Homo_sapiens/Location/Genome

3 / 47

Genes are unevenly distributed on chromosomes

Highly expressed genes positively correlated with:
- Very short indels
- High gene density
- High GC content
- High density of Short interspersed nuclear elements (SINE) repeats
- Low density of Long interspersed nuclear elements (LINE) repeats
- Both housekeeping and tissue-specific expression
The opposite is true for lowly expressed genes

Versteeg, Rogier, Barbera D. C. van Schaik, Marinus F. van Batenburg, Marco Roos, Ramin Monajemi, Huib Caron, Harmen J. Bussemaker, and Antoine H. C. van Kampen. “The Human Transcriptome Map Reveals Extremes in Gene Density, Intron Length, GC Content, and Repeat Pattern for Domains of Highly and Weakly Expressed Genes.” Genome Research 13, no. 9 (September 2003): 1998–2004. https://doi.org/10.1101/gr.1649303.

4 / 47

Genes are unevenly distributed on chromosomes

Chromosome 19 is the most gene dense chromosome in the human genome

5 / 47

Half of the human genome is low complexity

Retrotransposons - fossil records of evolution

McClintock's "jumping genes" in maize
Retrotransposons use a "copy/paste" mechanism - transcribed to RNA and then reverse transcribed into DNA and insert
DNA transposons use a "cut/paste" mechanism - excise themselves and insert to another place

https://www.ncbi.nlm.nih.gov/pubmed/19763152

6 / 47

Genome variability

A typical human genome differs from the reference genome at 4.1 to 5.0 million sites - Single Nucleotide Polymorphisms (SNPs)

Over 99.9% are SNPs or short indels
Only 1-4% are rare (frequency <0.5% in the population)
Contains 2,100 – 2,500 structural variants, which affect more bases (~20 million bases)
~1,000 large deletions
~1,094 Alu, L1, SINE (short interspersed nuclear element), VNTR (variable number tandem repeat) insertions
~160 CNVs
~10 inversions
~ 4 NUMTs (nuclear mitochondrial DNA variations)

Conrad DF, Keebler JE, DePristo MA, Lindsay SJ, Zhang Y, Casals F, Idaghdour Y, Hartl CL, Torroja C, Garimella KV, Zilversmit M, Cartwright R, Rouleau GA, Daly M, Stone EA, Hurles ME, Awadalla P; 1000 Genomes Project. Variation in genome-wide mutation rates within and between human families. Nat Genet. 2011 Jun 12;43(7):712-4. doi: 10.1038/ng.862. https://www.nature.com/articles/ng.862

7 / 47

Why sequence a reference genome?

Determine the "complete" sequence of a human haploid genome
Identify the sequence and location of every protein coding gene
Use as a "map" with which to track the location and frequency of genetic variation in the human genome
Unravel the genetic architecture of inherited and somatic human diseases
Understand genome and species evolution

8 / 47

DNA sequencing: Maxam-Gilbert, Sanger

Sequencing by synthesis (not degradation)
Radioactive primers hybridize to DNA
Polymerase + dNTPs (normal dNTPs) + ddNTP (dideoxynucleotides terminators) at low concentration
1 lane per base, visually interpret ladder

https://en.wikipedia.org/wiki/Maxam%E2%80%93Gilbert_sequencing

https://www.youtube.com/watch?v=bEFLBf5WEtc

9 / 47

How to sequence a human genome: Lee Hood automation

The Human Genome project: Early days

Green, Eric D., James D. Watson, and Francis S. Collins. "Human Genome Project: Twenty-Five Years of Big Biology." Nature 526, no. 7571 (October 1, 2015): 29–31. doi:10.1038/526029a.

10 / 47

The competing human genome projects

11 / 47

A first map of the human genome

http://www.nature.com/nature/journal/v409/n6822/full/409860a0.html

12 / 47

Human genome is sequenced!

13 / 47

The Human Genome roadmap

https://www.davidstreams.com/mis-apuntes/human-genome-project/

14 / 47

Evolution of sequencing technologies

"Massively parallel" sequencing
"High-throughput" sequencing
"Ultra high-throughput" sequencing
"Next generation" sequencing (NGS)
"Second generation" sequencing

15 / 47

Evolution of sequencing technologies

2005: 454 (Roche)
2006: Solexa (Illumina)
2007: ABI/SOLiD (Life Technologies)
2010: Complete Genomics
2011: Pacific Biosciences
2010: Ion Torrent (Life Technologies)
2015: Oxford Nanopore Technologies

16 / 47

Sequencing in a nutshell

Cut the long DNA into smaller segments (several hundreds to several thousand bases)
Sequence each segment: start from one end and sequence along the chain, base by base
The process stops after a while because the noise level is too high
Results from sequencing are many sequence pieces. The lengths vary, usually a few thousands from Sanger, and several hundreds from NGS
The sequence pieces are called "reads" for NGS data

17 / 47

Solexa (Illumina) sequencing (2006)

PCR amplify DNA fragments
Immobilize fragments on a solid surface, amplify
Reversible terminator sequencing with 4 color dye-labelled nucleotides

Video of Illumina sequencing, http://www.youtube.com/watch?v=77r5p8IBwJk (1.5m), https://www.youtube.com/watch?v=fCd6B5HRaZ8 (5m)

18 / 47

Cluster amplification by "bridge" PCR

https://binf.snipcademy.com/lessons/ngs-techniques/bridge-pcr

19 / 47

Clonal amplification

20 / 47

Base calling

6 cycles with base-calling

https://www.youtube.com/watch?v=IzXQVwWYFv4 https://www.youtube.com/watch?time_continue=65&v=tuD-ST5B3QA

21 / 47

Illumina sequencers

Massive improvement of the cluster density - higher output
Less expensive than the previous sequencers, Faster runs

https://www.illumina.com/systems/sequencing-platforms.html https://blog.genohub.com/2017/01/10/illumina-unveils-novaseq-5000-and-6000/ http://www.mrdnalab.com/illumina-novaseq.html

22 / 47

Solexa (Illumina) sequencing: summary

Advantages:

Best throughput, accuracy and read length for any 2nd gen. sequencer
Fast & robust library preparation

Disadvantages:

Inherent limits to read length (practically, 150bp)
Some runs are error prone

Video of Illumina sequencing https://www.youtube.com/watch?v=womKfikWlxM (5m)

23 / 47

Single-end vs. paired-end sequencing

Single-end sequencing: sequence one end of the DNA segment.
Paired-end sequencing: sequence both ends of a DNA segments.
- Result reads are "paired", separated by certain length (the length of the DNA segments, usually a few hundred bps).
- Paired-end data can be used as single-end, but contain extra information which is useful in some cases, e.g., detecting structural variations in the genome.
- Modeling technique is more complicated.

24 / 47

Paired-end sequencing - a workaround to sequence longer fragments

Read one end of the molecule, flip, and read the other end
Generate pair of reads separated by up to 500bp with inward orientation

25 / 47

Sequencing applications

NGS has a wide range of applications

DNA-seq: sequence genomic DNA
RNA-seq: sequence RNA products
ChIP-seq: detect protein-DNA interaction sites
Bisulfite sequencing (BS-seq): measure DNA methylation strengths
A lot of others

Basically replaced microarrays with better data: greater dynamic range and higher signal-to-noise ratios.

26 / 47

DNA-seq (Whole-Genome sequencing)

Sequence the untreated genomic DNA.
- Obtain DNA from cells, cut into small pieces then sequence the segments.
Goals: Compare with the reference genome and look for genetic variants
Single nucleotide polymorphisms (SNPs)
Insertions/deletions (indels),
Copy number variations (CNVs)
Other structural variations (gene fusion, etc.).
- De novo assembly of a new genome.

27 / 47

Variations of DNA-seq

Targeted sequencing, e.g., exome sequencing
- Sequence the genomic DNA at targeted genomic regions
- Cheaper than whole genome DNA-seq, so that money can be spent to get bigger sample size (more individuals)
- The targeted genomic regions need to be "captured" first using technologies like microarrays
Metagenomic sequencing
- Sequence the DNA of a mixture of species, mostly microbes, in order to understand the microbial environments
- The goal is to determine number of species, their genome and proportions in the population
- De novo assembly is required. But the number and proportions of species are unknown, so it poses challenge to assembly

28 / 47

RNA-seq

Sequence the "transcriptome": the set of RNA molecules

Goals

Catalogue RNA products
Determine transcriptional structures: alternative splicing, gene fusion, etc.
Quantify gene expression: the sequencing version of gene expression microarray

29 / 47

ChIP-seq

Chromatin-Immunoprecipitation (ChIP) followed by sequencing (seq): sequencing version of ChIP-chip
Used to detect locations of certain "events" on the genome:
- Transcription factor binding
- DNA methylations and histone modifications
A type of "captured" sequencing. ChIP step is to capture genomic regions of interest

30 / 47

What matters is what you feed into the sequencing machine

https://liorpachter.wordpress.com/seq/

31 / 47

Extra32 / 47

GENCODE – Annotation Gene Features

~21,000 protein coding genes
PolyA+
- Almost completely spliced before nuclear export – co-trascriptional splicing "first transcribed – first spliced"
- Most have at least 2 dominate splice forms
- Show allele specific expression – potential imprinting
PolyA-
- Many are lncRNAs
- Also shows allele specific expression

https://www.gencodegenes.org/

33 / 47

GENCODE – Annotation Gene Features

Most (62%) of the genome is transcribed
- <5% can be identified as exons
~12,000 pseudogenes – results of duplications
- 876 are transcribed – can have regulatory function as decoys
- Infrequently spliced
~10,000 lncRNA = noncoding RNAs >200bp
- 92% are not translated
- Show tissue-specific expression – more than protein coding genes
- 33% are primate specific but few are human specific – most new genes are in this category
- Poorly spliced – most are two exon transcripts

34 / 47

GENCODE – Annotation Gene Features

~9000 small RNAs - many of the lncRNA transcripts are processed into stable small RNAs
- tRNA, miRNA, siRNA, snRNA, snoRNA
~82,000 – 128,000 transcription start sites - depending on detection method
- ~44% are near annotated transcripts
~5,000 RNA edits occur post transcription
- Mostly A to G(I) conversions (APOBEC pathway)
- 94% are in transcribed repeat elements
  - Remaining are mostly in introns, 3’UTRs
  - Very few (123) in protein coding sequences

35 / 47

ION Torrent-pH Sensing of Base Incorporation

36 / 47

Platforms: Ion Torrent

Low substitution error rate, in/dels problematic, no paired end reads
Inexpensive and fast turn-around for data production
Improved computational workflows for analysis

37 / 47

Pacific Biosciences: Long reads

Structural variant discovery
De novo genome assembly

https://www.forbes.com/forbes/2009/1005/revolutionaries-science-genomics-gene-machine.html

38 / 47

Pacific Biosciences: summary

Key Points:

1 DNA molecule and 1 polymerase in each well (zero-mode waveguide)
4 colors flash in real time as polymerase acts
Methylated cytosine has distinct pattern
No theoretical limit to DNA fragment length

Caveats:

Higher error rate (1-2%), but they are random
Lower throughput, roughly 5 gigabases per run

39 / 47

Nanopore sequencing

Nearly 30-years old technology

http://www2.technologyreview.com/news/427677/nanopore-sequencing/

40 / 47

Nanopore sequencing

Nanopore sequencing with ONT is accurate and relatively reliable
Current yield per run: ~5 Gbp, 97% identity (i.e., 3% error rate)

https://www.technologyreview.com/s/600887/with-patent-suit-illumina-looks-to-tame-emerging-british-rival-oxford-nanopore/ Video of Ion Torrent chemistry, http://www.youtube.com/watch?v=yVf2295JqUg (2.5m)

41 / 47

Nanopore sequencing

Key advantage - portability

Video of Nanopore DNA sequencint technology https://www.youtube.com/watch?v=CE4dW64x3Ts (4.5m) https://phys.org/news/2016-08-nasa-dna-sequencing-space-success.html

42 / 47

Nanopore for human genome sequencing

Closes 12 gaps
Phased the entire major histocompatibility complex (MHC) region, one of the most gene-dense and highly variable regions of the genome

Jain, Miten, Sergey Koren, Karen H Miga, Josh Quick, Arthur C Rand, Thomas A Sasani, John R Tyson, et al. “Nanopore Sequencing and Assembly of a Human Genome with Ultra-Long Reads.” Nature Biotechnology, January 29, 2018. https://doi.org/10.1038/nbt.4060. https://www.genengnews.com/gen-exclusives/first-nanopore-sequencing-of-human-genome/77901044

43 / 47

Nanopore technology

Nanopore sequencing yields raw signals reflecting modulation of the ionic current at each pore by a DNA molecule.
The resulting time-series of nanopore translocation, ‘events’, are base-called by proprietary software running as a cloud service.

https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btu555

44 / 47

Nanopore base callers

Proper base calling is a paramount, as it defines whether the technology is good or bad.
Nanonet, Albacore, Scrappie
Most modern basecallers use neural networks.

https://github.com/rrwick/Basecalling-comparison

45 / 47

Nanopore analysis

The resulting files for each sequenced read are stored in ‘FAST5’ format, an application of the HDF5 format.
poretools - a toolkit for analyzing nanopore sequence data.

https://github.com/arq5x/poretools https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btu555

46 / 47

PacBio vs. Oxford Nanopore sequencing

https://blog.genohub.com/2017/06/16/pacbio-vs-oxford-nanopore-sequencing/

47 / 47

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help