Approximate matching

Approximate matchingMikhail DozmorovVirginia Commonwealth University02-17-20211 / 13

Alignment

Measures of Sequence Similarity

For two sequences s1 and s2, we need to define their distance d(s1, s2)
The greater the distance, the less similar between these two sequences.
- d(s,s) = 0 for fully similar sequences

2 / 13

Hamming distance

For X and Y, where |X|=|Y|, hamming distance = minimum number of substitutions needed to turn one into the other

3 / 13

Edit distance

aka Levenshtein distance

For X and Y, edit distance = minimum number edits (substitutions, insertions, deletions) needed to turn one into the other

4 / 13

Example of a simple scoring matrix

   A  G  C  T
A  1 -1 -1 -1
G -1  1 -1 -1
C -1 -1  1 -1
T -1 -1 -1  1

Matches are score as 1, mismatches - as -1

5 / 13

Example of penalty matrix

Suppose your empirically defined matrix is

   A  G  C  T 
A 10 -1 -3 -4 
G -1  7 -5 -3 
C -3 -5  9  0 
T -4 -3  0  8

with gap penalty= -5

Apply it to the secong sequence:

Read: AGACTAGTTAC
Ref:  CGA---GACGT

-3+7+10+(3)(-5)+7-4+0-1+0 = 1

6 / 13

Scoring Schemes

For applications to molecular biology, recognize that certain changes are more likely to occur naturally than others

For example, amino acid substitutions tend to be conservative: the replacement of one amino acid by another with similar size or physicochemical properties is more likely to have occurred than its replacement by another amino acid with greater difference in their properties. Or, the deletion of a succession of contiguous bases or amino acids is a more probable event than the independent deletion of the same number of bases or amino acids at noncontiguous positions in the sequences

We may wish to assign variable weights to different edit operations.

7 / 13

Scoring Schemes in real life

Transition mutations (a-g and t-c) are more common than transversions ((a, g)-(t, c))
A substitution matrix that reflects this:

8 / 13

Pioneer of Comp. Biology - Margeret Dayhoff

Trained in math and quantum chemistry
Associate director of the newly-formed National Biomedical Research Foundation
Wrote seminal FORTRAN programs to derive amino acids sequences by using partial overlaps of fragmented amino acid sequences
PAM (Point accepted mutation) matrices
Realized the applications to nucleic acids and gene sequences

9 / 13

PAM250

The similarity matrix is frequently used to score aligned peptide sequences to determine the similarity of those sequences
Derived from comparing aligned sequences of proteins with known homology and determining the "point accepted mutations" (PAM) observed
The frequencies of these mutations are in this table as a "log odds-matrix" where: $M_{i j} = 10 (l o g_{10} R_{i j})$ , where $M_{i j}$ is the matrix element and $R_{i j}$ is the probability of that substitution as observed in the database, divided by the normalized frequency of occurence for amino acid $i$ .

10 / 13

PAM250

11 / 13

BLOSUM- BLOcks Substitution Matrix

Steven Henikoff and Jorja Henikoff developed the family of BLOSUM matrices for scoring substitutions in amino acid sequence comparisons. Their goal was to replace the Dayhoff matrix with one that would perform best in identifying distant relationships, making use of the much larger amount of data that had become available since Dayhoff's work

The BLOSUM matrices are based on the BLOCKS database of aligned protein sequences; hence the name BLOcks SUbstitution Matrix. From regions of closely-related proteins alignable without gaps, Henikoff calculated the ratio, of the number of observed pairs of amino acids at any position, to the number of pairs expected from the overall amino acid frequencies. As in the Dayhoff matrix, the results are expressed as log-odds

In order to avoid overweighting closely-related sequences, the Henikoffs replaced groups of proteins that have sequence identities higher than a threshold by either a single representative or a weighted average. The threshold 62% produces the commonly used BLOSUM62 substitution matrix. This is offered by all programs as an option and is the default in most

12 / 13

BLOSUM- BLOcks Substitution Matrix

Based on conserved blocks bounded in similarity (at least X% identical)

Matrices for divergent proteins are derived using appropriate X%

BLOSUM62 - sequences having at least 62% identity are merged together
BLOSUM30 - sequences having at least 30% identity are merged together
BLOSUM90 - sequences having at least 90% identity are merged together

13 / 13

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help