Needleman-Wunsch global alignment

Needleman-Wunsch global alignmentMikhail DozmorovVirginia Commonwealth University02-17-20211 / 10

Alignment goals

Homology - sequence similarity - helps to infer functions of genes uncharacterized in one organism but known in another
Global sequence alighment (Needleman-Wunch)
Gapped local sequence alignment (Smith-Waterman)

2 / 10

Needleman-Wunsch algorithm

The problem of finding best possible alignment of two sequences is solved by Saul B. Needleman and Christian D. Wunsch in 1970
General goal is to obtain optimal global alignment between two sequences, allowing gaps
It refers as optimal matching problem or global alignment
Online demo: http://experiments.mostafa.io/public/needleman-wunsch/
Wikipedia: https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm

Needleman, S. B., and C. D. Wunsch. “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins90057-4).” Journal of Molecular Biology, (March 1970)

3 / 10

Dynamic programming

The Needleman-Wunsch algorithm is an example of dynamic programming, a discipline invented by Richard E. Bellman in 1953
Dynamic programming - solving complex problems by breaking them down into simpler subproblems, and remember subproblem solutions
- Guaranteed to provide the optimal alignment for a given set of scoring functions
- Slow due to the very large number of computational steps $O(n^2)$

4 / 10

Needleman-Wunsch algorithmIn its simplest form, assign a value \(A\) where the aligned pair consists of the same letters (nucleotides, amino acids) If the letters differ, subtract \(B\) - edit penalty.
If a gap needs to be made, subtract a gap penalty times the number of gaps

5 / 10

Steps

Initialization of the score matrix $D(i,j)$
Calculation of scores, such that

$$ D(i,j) = max \begin{cases} D(i-1,j-1) + s(i,j) \\ D(i-1,j) + g \\ D(i,j-1) + g \end{cases} $$

Where $s(i,j)$ is the substitution score for entries $i$ and $j$, and $g$ is the gap penalty

Once you construct the matrix, you can trace back the best score for an alignment

6 / 10

Best possible alignment for two sequences

The N-W algorithm is mathematically proven to find the best alignment of two sequences
N-W algorithm takes $O(n^2)$ to find best alignment of $n$ letters in two sequences
Accessing all possible alignments one by one would date $\binom{2n}{n}$, so $n^2$ is much smaller

7 / 10

Sequence searching and alignment

FASTA - a DNA and protein sequence alignment software package by David J. Lipman and William R. Pearson in 1985.
BLAST (Basic Local Alignment Search Tool) - an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA sequences.
- Designed by Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and David J. Lipman
- Innovation: heuristic database search (speed), followed by optimal alignment (accuracy, statistics)

8 / 10

BLAST

BLAST is not used for NGS because it is too slow.
Format for command line version: blastall -d assemblyfasta -i genefasta -o output.blast -p blastn -e 1e-15
- -i indicates what is the gene file
- -o indicates what you want the output to be
- -p with ending "n" means nucleotide alignment -e statistical significance of alignments
Magic-BLAST is an alternative

https://ncbi.github.io/magicblast/

https://ncbiinsights.ncbi.nlm.nih.gov/2016/10/13/introducing-magic-blast/

9 / 10

BWA, Bowtie aligners

Bowtie employs Burrows-Wheeler transform (BWT) based on the full-text minute-space (FM) index. Index is built using bwa index, bowtie2-build

acaacg$ → $acaacg →(sort) $acaacg → gc$aaac
          g$acaac         aacg$ac
          cg$acaa         acaacg$
          acg$aca         acg$aca
          aacg$ac         caacg$a
          caacg$a         cg$acaa
          acaacg$         g$acaac

Bowtie has a memory footprint of 1.3 GB for the human genome. Very fast.
The last first mapping can transform it back to the original sequence.

10 / 10

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help