Alignment introduction

Alignment introductionMikhail DozmorovVirginia Commonwealth University02-17-20211 / 30

The genomics workflow

http://www.jpathinformatics.org/viewimage.asp?img=JPatholInform_2012_3_1_40_103013_u3.jpg

Alignment goals

Alignment - the process by which we discover how or where the read sequence is similar to the reference sequence. Finding best match of the read sequence within the reference sequence.

The human reference genome is big and complex (~3.2 billion bases)
Sequencing data is big and complex (~1 billion short reads/run)
Must find a home to each short read in the reference genome

Alignment goals

Poisson discribution

y5  <- dpois(x, 5); y10 <- dpois(x, 10); y15 <- dpois(x, 15)
plot(x, y5, col = 1); lines(x, y10); lines(x, y15)

Estimating library complexity with a Poisson model

For Poisson sampling, we can write the (truncated) distribution over $x_{i}$ the times we sequence the $i^{t h}$ molecule as: $P r (x_{i} | λ) = \frac{1}{K_{L, R} (λ)} * \frac{e^{- λ} λ^{x_{i}}}{x_{i}!}$ $K_{L, R} (λ) = \sum_{x = L}^{R} P r (x_{i} | λ)$ (The probability is $0$ if $x_{i}$ is less than $L$ or greater than $R$ )
We can estimate the maximum likelihood rate parameter $λ$ from a vector of observations $x$

Maximum likelihood library size

$K_{L, R} (λ) = \sum_{x = L}^{R} P r (x_{i} | λ)$

$M$ unique sequences observed, maximum likelihood library size is

$\hat{C} = \frac{M}{K_{L, R} (λ)}$

Approximate solution

$\hat{C} = \frac{M}{1 - P o i s s o n (0, λ)}$

Problem with Poisson distribution

Poisson Library Complexity model 150 '1000 Genome' Datasets
Estimate library complexity from 10% of uniformly sampled reads vs. from all reads

[height=130px]{img/library_complexity1.png}

Poisson $λ = M e a n = V a r i a n c e$

Gamma distributed sampling rates describe the entire population (library preparation)

\bigbreak

Poisson sampling to form a smaller sample (sequencing)

\bigbreak

Negative binomial distribution characterizes the resulting occurrence histogram

[height=220px]{img/gamma_poisson_negative_binomial.png}

\end{columns}`

The gamma distribution is a "conjugate prior" for the Poission distribution

$P o i s s o n (x; λ) = \frac{λ^{x} e^{- λ}}{x!}$

$G a m m a (x, α, β) = \frac{β^{α} x^{α - 1} e^{- β x}}{Γ (α)}$

$N B (y; α, β) = \int_{0}^{\infty} P o i s s o n (y; x) G a m m a (x; α, β) d x$

Negative Binomial model for sequence occurrences

$C$ - library complexity (latent, fit to observed data)
$N$ - number of reads
$M$ - total number of unique sequences
$λ = N / C$
$k$ - dispersion (latent, fit to observed data)

$P r (x_{i}, λ, k) = N B (x_{i} | λ, k) = N B (x_{i} | n, p)$

$p = λ / (λ + 1 / k)$
$n = 1 / k$

Simulation results show that the Gamma Possion works well for non-uniform libraries

True library complexity: 1M unique molecules
Vary $k$ (controls sampling rate variance)
Given 100K reads ($\lambda=0.1$), assess estimates from both models

k=0.1\ \ \ Poisson: 0.93M\ \ \ \ GP: 0.96M
k=1\ \ \ \ \ Poisson: 0.52M\ \ \ \ GP: 1.01M
k=10\ \ \ \ Poisson: 0.12M\ \ \ \ GP: 1.10M
k=20\ \ \ \ Poisson: 0.07M\ \ \ \ GP: 0.68M

Negative Binomial Library Complexity better model 150 '1000 Genome' Datasets

[height=170px]{img/library_complexity2.png}

Data are "overdispersed" (variance greater than mean)

Marginal value of additional sequencing

$C$ – library complexity (latent – estimated)
$N$ – number of reads
$M$ – number of unique sequences

$M$ can be estimated by $(1 - N e g a t i v e B i n o m i a l (0 | λ, k)) * C$

Assume we have $r$ more reads $s = (N + r) / N$
Replace $λ$ by $s * λ$ to estimate $M^{'}$ achieved with $r$ more reads

Marginal value of additional sequencing

[height=190px]{img/marginal_sequencing_value.png}

End: MIT7_91JS14_Lecture5.pdf - BWT, FM index, https://www.youtube.com/watch?v=P3ORBMon8aw&list=PLUl4u3cNGP63uK-oWiLgO7LLJV6ZCWXac&index=5 -->

2 / 30

Genome Assembly Algorithms

Problem: Exact String Matching

Input: A text string $T$ , where $‖ T ‖ = n$ , and a pattern string $P$ , where $‖ P ‖ = m$
Output: An index $i$ such that $T_{i + j - 1} = P_{j}$ for all $1 \leq j \leq m$ , i.e. showing that $P$ is a substring of $T$

3 / 30

Analysis

This algorithm might use only $n$ steps if we are lucky, e.g. $T = a a a a a a a a a a a$ , and $P = b b b b b b b$
We might need $\sim n \times m$ steps if we are unlucky, e.g. $T = a a a a a a a a a a a$ , and $P = a a a a a a b$
We can’t say what happens "in practice", so we settle for a worst case scenario (longest possible running time)

4 / 30

Analysis

This algorithm might use only $n$ steps if we are lucky, e.g. $T = a a a a a a a a a a a$ , and $P = b b b b b b b$
We might need $\sim n \times m$ steps if we are unlucky, e.g. $T = a a a a a a a a a a a$ , and $P = a a a a a a b$
We can’t say what happens "in practice", so we settle for a worst case scenario (longest possible running time)
By being more clever, we can reduce the worst case to $O (n m)$
Certain generalizations won’t change this, like stopping after the first occurrence of the pattern
Certain other generalizations seem more complicated, like matching with gaps

4 / 30

Algorithm Complexity

We use the "Big oh" notation to state an upper bound on the number of steps that an algorithm takes in the worst case. Thus the brute force string matching algorithm is $O (n m)$ , or takes quadratic time

A linear time algorithm, i.e. $O (n + m)$ , is fast enough for almost any application
A quadratic time algorithm is usually fast enough for small problems, but not big ones, since $1000^{2} = 1, 000, 000$ steps is reasonable but $1, 000, 000^{2}$ is not
An exponential-time algorithm, i.e. $O (2^{n})$ or $O (n!)$ , can only be fast enough for tiny problems, since $2^{20}$ and $10!$ are already up to $1, 000, 000$

5 / 30

Algorithm Complexity

Unfortunately, for many alignment problems, there is no known polynomial algorithm
Even worse, most of these problems can be proven NP-complete, meaning that no such algorithm can exist!

6 / 30

String graph

Alignments that may be transitively inferred from all pairwise alignments are removed
A graph is created with a vertex for the endpoint of every read
Edges are created both for each unaligned interval of a read and for each remaining pairwise overlap
Vertices connect edges that correspond to the reads that overlap
When there is allelic variation, alternative paths in the graph are formed

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4745987/

7 / 30

String graph

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4745987/

8 / 30

Real-world assembly methods

OLC - Overlap-Layout-Consensus assembly
DBG - De Bruijn graph assembly
Both handle unresolvable repeats by essentially leaving them out
Unresolvable repeats break the assembly into fragments Fragments are contigs (short for contiguous)

9 / 30

Overlap‐layout‐consensus (OLC)

10 / 30

Overlap graph formulation

Treat each sequence as a "node"
Draw an edge between two nodes if there is significant overlap between the two sequences
Hopefully the contig covers all or large number of sequences, once for each sequence
In other words, we are looking for Hamiltonian path in the overlap graph
Pros: straightforward formulation
Cons: no efficient accurate algorithm; repeats

11 / 30

Overlap‐layout‐consensus (OLC)

All pairwise alignments (arrows) between reads (solid bars) are detected.
Reads are merged into contigs (below the vertical arrow) until a read at a repeat boundary (split colour bar) is detected, leading to a repeat that is unresolved and collapsed into a single copy.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4745987/

12 / 30

de Bruijn assembly

Reads are decomposed into overlapping k‐mers
Identical k‐mers are merged and connected by an edge when appearing adjacently in reads
Contigs are formed by merging chains of k‐mers until repeat boundaries are reached
If a k‐mer appears in multiple positions (red segment) in the genome, it will fragment assemblies and additional graph operations must be applied to resolve such small repeats
The k‐mer approach is ideal for short‐read data generated by massively parallel sequencing (MPS)

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4745987/

13 / 30

de Bruijn assembly

An example of the decomposition for k = 3 nucleotides is shown, although in practice k ranges between 31 and 200 nucleotides

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4745987/

14 / 30

de Bruijn assembly problems

Erroneous data create three types of graph structures:

"tips" due to errors at the edges of reads
"bulges" due to internal read errors or to nearby tips connecting
erroneous connections due to cloning errors or to distant merging tips

15 / 30

Velvet: de novo assembly using very short reads

16 / 30

Velvet: de novo assembly using de Bruijn graph

https://www.ebi.ac.uk/~zerbino/velvet/

Zerbino, Daniel R., and Ewan Birney. "Velvet: Algorithms for de Novo Short Read Assembly Using de Bruijn Graphs." Genome Research, (May 2008)

17 / 30

Alignment problems

The genome being sequenced contains genomic variants
Reads contain two kinds of errors: base substitutions and indels. Base substitutions occur with a frequency from 0.5 – 2%. Indels occur roughly 10 times less frequently
Strand orientation is unknown
Computers excel at finding exact matches. Errors should be explicitly handled
"Fuzzy" pattern matching is much more computationally expensive

18 / 30

Alignment problems

> 50% of human genome is repeats - a major problem for fragment assembly
Over 1 million Alu repeats (about 300 bp)
About 200,000 LINE repeats (1000 bp and longer)

19 / 30

Alignment with repeats

20 / 30

Gaps

Sequence-coverage gaps - Sequencing gaps occur, under the simplest condition, where no sequence reads have been sampled for a particular portion of the genome
Segmental duplication-associated gaps - Over one-third (206/540) of the euchromatic gaps in the human reference genome (GRCh38) are flanked by large, highly identical segmental duplications
Satellite-associated gaps - These include short and long runs of tandem repeats designated as short tandem repeats (STRs; also known as microsatellites), variable number of tandem repeats (VNTRs; also known as macrosatellites) and Mb-sized centromeric satellite repeats

21 / 30

Gaps

Muted gaps - Muted gaps are defined as regions that are inadvertently closed in an assembly but that actually show additional or different sequences in the vast majority of individuals
Allelic variation gaps - Some regions of a genome also show extraordinary patterns of allelic variation, often reflecting deep evolutionary coalescence

22 / 30

Gaps

http://www.nature.com/nrg/journal/v16/n11/full/nrg3933.html

23 / 30

Coverage

The coverage of a sequencing project is the ratio of the total sequenced fragment length to the genome length, i.e. $n l / T$
Gaps are very difficult and expensive to close in any sequencing strategy, meaning that very high coverage is necessary to use shotgun sequencing on a large genome

24 / 30

Evaluating Assemblies

Coverage is a measure of how deeply a region has been sequenced
The Lander-Waterman model predicts 8-10 fold coverage is needed to minimze the number of contigs for a 1 Mbp genome
The N50 length is a statistics in genomics defined as the shortest contig at which half of the total length of the assembly is made of contigs of that length or greater
It is commonly used as a metric to summarize the contiguity of an assembly

25 / 30

Longer sequencing to complete human genomes

Human genome is incomplete - ~160 gaps in euchromatin
~55% of them have been closed using Oxford Nanopore technology

https://www.nature.com/nature/journal/v517/n7536/full/nature13907.html

26 / 30

Improving the Human Reference Genome(s)

http://genome.wustl.edu/projects/detail/reference-genomes-improvement/

Jain, Miten, Sergey Koren, Karen H Miga, Josh Quick, Arthur C Rand, Thomas A Sasani, John R Tyson, et al. “Nanopore Sequencing and Assembly of a Human Genome with Ultra-Long Reads.” Nature Biotechnology, January 29, 2018. https://doi.org/10.1038/nbt.4060. - MinION nanopore sequencing to sequence human genome. Closed 12 gaps, fully typed MHC region. PCR-free sequencing preserves epigenetic modifications. Canu genome assembler. GraphMap, Minimap2 for mapping long reads. SVTyper, LUMPY for structural variants.

https://www.genengnews.com/gen-exclusives/first-nanopore-sequencing-of-human-genome/77901044

27 / 30

Longer reads - more errors

The increased read length and error rate of single-molecule sequencing has challenged genome assembly programs originally designed for shorter, highly accurate reads
Several new approaches have been developed to address this, roughly categorized as hybrid, hierarchical, or direct

28 / 30

Longer reads - more errors

Hybrid methods use single-molecule reads to reconstruct the long-range structure of the genome, but rely on complementary short reads for accurate base calls
Hierarchical methods do not require a secondary technology and instead use multiple rounds of read overlapping (alignment) and correction to improve the quality of the single-mol- ecule reads prior to assembly
Direct methods attempt to assemble single-molecule reads from a single overlapping step without any prior correction

Hierarchical strategy is the most suitable to produce continuous de novo assembly

29 / 30

Global and local alignment

Global alignment - comparing two sequences over their entire length, identifying long insertion and deletion. Identifying every mutation in your sequences
Local alignment - comparing sequences with partial homology. Making high-quality alignment

30 / 30

The genomics workflow

http://www.jpathinformatics.org/viewimage.asp?img=JPatholInform_2012_3_1_40_103013_u3.jpg

Alignment goals

Alignment - the process by which we discover how or where the read sequence is similar to the reference sequence. Finding best match of the read sequence within the reference sequence.

The human reference genome is big and complex (~3.2 billion bases)
Sequencing data is big and complex (~1 billion short reads/run)
Must find a home to each short read in the reference genome

Alignment goals

Poisson discribution

y5  <- dpois(x, 5); y10 <- dpois(x, 10); y15 <- dpois(x, 15)
plot(x, y5, col = 1); lines(x, y10); lines(x, y15)

Estimating library complexity with a Poisson model

For Poisson sampling, we can write the (truncated) distribution over $x_{i}$ the times we sequence the $i^{t h}$ molecule as: $P r (x_{i} | λ) = \frac{1}{K_{L, R} (λ)} * \frac{e^{- λ} λ^{x_{i}}}{x_{i}!}$ $K_{L, R} (λ) = \sum_{x = L}^{R} P r (x_{i} | λ)$ (The probability is $0$ if $x_{i}$ is less than $L$ or greater than $R$ )
We can estimate the maximum likelihood rate parameter $λ$ from a vector of observations $x$

Maximum likelihood library size

$K_{L, R} (λ) = \sum_{x = L}^{R} P r (x_{i} | λ)$

$M$ unique sequences observed, maximum likelihood library size is

$\hat{C} = \frac{M}{K_{L, R} (λ)}$

Approximate solution

$\hat{C} = \frac{M}{1 - P o i s s o n (0, λ)}$

Problem with Poisson distribution

Poisson Library Complexity model 150 '1000 Genome' Datasets
Estimate library complexity from 10% of uniformly sampled reads vs. from all reads

[height=130px]{img/library_complexity1.png}

Poisson $λ = M e a n = V a r i a n c e$

Gamma distributed sampling rates describe the entire population (library preparation)

\bigbreak

Poisson sampling to form a smaller sample (sequencing)

\bigbreak

Negative binomial distribution characterizes the resulting occurrence histogram

[height=220px]{img/gamma_poisson_negative_binomial.png}

\end{columns}`

The gamma distribution is a "conjugate prior" for the Poission distribution

$P o i s s o n (x; λ) = \frac{λ^{x} e^{- λ}}{x!}$

$G a m m a (x, α, β) = \frac{β^{α} x^{α - 1} e^{- β x}}{Γ (α)}$

$N B (y; α, β) = \int_{0}^{\infty} P o i s s o n (y; x) G a m m a (x; α, β) d x$

Negative Binomial model for sequence occurrences

$C$ - library complexity (latent, fit to observed data)
$N$ - number of reads
$M$ - total number of unique sequences
$λ = N / C$
$k$ - dispersion (latent, fit to observed data)

$P r (x_{i}, λ, k) = N B (x_{i} | λ, k) = N B (x_{i} | n, p)$

$p = λ / (λ + 1 / k)$
$n = 1 / k$

Simulation results show that the Gamma Possion works well for non-uniform libraries

True library complexity: 1M unique molecules
Vary $k$ (controls sampling rate variance)
Given 100K reads ($\lambda=0.1$), assess estimates from both models

k=0.1\ \ \ Poisson: 0.93M\ \ \ \ GP: 0.96M
k=1\ \ \ \ \ Poisson: 0.52M\ \ \ \ GP: 1.01M
k=10\ \ \ \ Poisson: 0.12M\ \ \ \ GP: 1.10M
k=20\ \ \ \ Poisson: 0.07M\ \ \ \ GP: 0.68M

Negative Binomial Library Complexity better model 150 '1000 Genome' Datasets

[height=170px]{img/library_complexity2.png}

Data are "overdispersed" (variance greater than mean)

Marginal value of additional sequencing

$C$ – library complexity (latent – estimated)
$N$ – number of reads
$M$ – number of unique sequences

$M$ can be estimated by $(1 - N e g a t i v e B i n o m i a l (0 | λ, k)) * C$

Assume we have $r$ more reads $s = (N + r) / N$
Replace $λ$ by $s * λ$ to estimate $M^{'}$ achieved with $r$ more reads

Marginal value of additional sequencing

[height=190px]{img/marginal_sequencing_value.png}

End: MIT7_91JS14_Lecture5.pdf - BWT, FM index, https://www.youtube.com/watch?v=P3ORBMon8aw&list=PLUl4u3cNGP63uK-oWiLgO7LLJV6ZCWXac&index=5 -->

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Alignment introduction

Mikhail Dozmorov

Virginia Commonwealth University

02-17-2021

The genomics workflow

Alignment goals

Alignment goals

Poisson discribution

Estimating library complexity with a Poisson model

Maximum likelihood library size

Problem with Poisson distribution

The gamma distribution is a "conjugate prior" for the Poission distribution

Negative Binomial model for sequence occurrences

Simulation results show that the Gamma Possion works well for non-uniform libraries

Negative Binomial Library Complexity better model 150 '1000 Genome' Datasets

Marginal value of additional sequencing

Marginal value of additional sequencing

Genome Assembly Algorithms

Analysis

Analysis

Algorithm Complexity

Algorithm Complexity

String graph

String graph

Real-world assembly methods

Overlap‐layout‐consensus (OLC)

Overlap graph formulation

Overlap‐layout‐consensus (OLC)

de Bruijn assembly

de Bruijn assembly

de Bruijn assembly problems

Velvet: de novo assembly using very short reads

Velvet: de novo assembly using de Bruijn graph

Alignment problems

Alignment problems

Alignment with repeats

Gaps

Gaps

Gaps

Coverage

Evaluating Assemblies

Longer sequencing to complete human genomes

Improving the Human Reference Genome(s)

Longer reads - more errors

Longer reads - more errors

Global and local alignment

The genomics workflow

Alignment goals

Alignment goals

Poisson discribution

Estimating library complexity with a Poisson model

Maximum likelihood library size

Problem with Poisson distribution

The gamma distribution is a "conjugate prior" for the Poission distribution

Negative Binomial model for sequence occurrences

Simulation results show that the Gamma Possion works well for non-uniform libraries

Negative Binomial Library Complexity better model 150 '1000 Genome' Datasets

Marginal value of additional sequencing

Marginal value of additional sequencing

Help