+ - 0:00:00
Notes for current slide
Notes for next slide

Approximate matching

Mikhail Dozmorov

Virginia Commonwealth University

02-17-2021

1 / 13

Alignment

Measures of Sequence Similarity

  • For two sequences s1 and s2, we need to define their distance d(s1, s2)

  • The greater the distance, the less similar between these two sequences.

    • d(s,s) = 0 for fully similar sequences
2 / 13

Hamming distance

For X and Y, where |X|=|Y|, hamming distance = minimum number of substitutions needed to turn one into the other

3 / 13

Edit distance

aka Levenshtein distance

For X and Y, edit distance = minimum number edits (substitutions, insertions, deletions) needed to turn one into the other

4 / 13

Example of a simple scoring matrix

A G C T
A 1 -1 -1 -1
G -1 1 -1 -1
C -1 -1 1 -1
T -1 -1 -1 1

Matches are score as 1, mismatches - as -1

5 / 13

Example of penalty matrix

Suppose your empirically defined matrix is

A G C T
A 10 -1 -3 -4
G -1 7 -5 -3
C -3 -5 9 0
T -4 -3 0 8

with gap penalty= -5

Apply it to the secong sequence:

Read: AGACTAGTTAC
Ref: CGA---GACGT

-3+7+10+(3)(-5)+7-4+0-1+0 = 1

6 / 13

Scoring Schemes

For applications to molecular biology, recognize that certain changes are more likely to occur naturally than others

For example, amino acid substitutions tend to be conservative: the replacement of one amino acid by another with similar size or physicochemical properties is more likely to have occurred than its replacement by another amino acid with greater difference in their properties. Or, the deletion of a succession of contiguous bases or amino acids is a more probable event than the independent deletion of the same number of bases or amino acids at noncontiguous positions in the sequences

We may wish to assign variable weights to different edit operations.

7 / 13

Scoring Schemes in real life

  • Transition mutations (a-g and t-c) are more common than transversions ((a, g)-(t, c))

  • A substitution matrix that reflects this:

8 / 13

Pioneer of Comp. Biology - Margeret Dayhoff

  • Trained in math and quantum chemistry

  • Associate director of the newly-formed National Biomedical Research Foundation

  • Wrote seminal FORTRAN programs to derive amino acids sequences by using partial overlaps of fragmented amino acid sequences

  • PAM (Point accepted mutation) matrices

  • Realized the applications to nucleic acids and gene sequences

9 / 13

PAM250

  • The similarity matrix is frequently used to score aligned peptide sequences to determine the similarity of those sequences

  • Derived from comparing aligned sequences of proteins with known homology and determining the "point accepted mutations" (PAM) observed

  • The frequencies of these mutations are in this table as a "log odds-matrix" where: Mij=10(log10Rij), where Mij is the matrix element and Rij is the probability of that substitution as observed in the database, divided by the normalized frequency of occurence for amino acid i.

10 / 13

PAM250

11 / 13

BLOSUM- BLOcks Substitution Matrix

Steven Henikoff and Jorja Henikoff developed the family of BLOSUM matrices for scoring substitutions in amino acid sequence comparisons. Their goal was to replace the Dayhoff matrix with one that would perform best in identifying distant relationships, making use of the much larger amount of data that had become available since Dayhoff's work

The BLOSUM matrices are based on the BLOCKS database of aligned protein sequences; hence the name BLOcks SUbstitution Matrix. From regions of closely-related proteins alignable without gaps, Henikoff calculated the ratio, of the number of observed pairs of amino acids at any position, to the number of pairs expected from the overall amino acid frequencies. As in the Dayhoff matrix, the results are expressed as log-odds

In order to avoid overweighting closely-related sequences, the Henikoffs replaced groups of proteins that have sequence identities higher than a threshold by either a single representative or a weighted average. The threshold 62% produces the commonly used BLOSUM62 substitution matrix. This is offered by all programs as an option and is the default in most

12 / 13

BLOSUM- BLOcks Substitution Matrix

Based on conserved blocks bounded in similarity (at least X% identical)

Matrices for divergent proteins are derived using appropriate X%

  • BLOSUM62 - sequences having at least 62% identity are merged together
  • BLOSUM30 - sequences having at least 30% identity are merged together
  • BLOSUM90 - sequences having at least 90% identity are merged together
13 / 13

Alignment

Measures of Sequence Similarity

  • For two sequences s1 and s2, we need to define their distance d(s1, s2)

  • The greater the distance, the less similar between these two sequences.

    • d(s,s) = 0 for fully similar sequences
2 / 13
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow