Big Data: string similarity: mind the Gap. The Affine Gap.


In my previous blog posts we have learned that

  • Crts is similar to Cats – so our autocorrect software might choose this word from vocabulary to correct Crts (Levenstein distance)
  • Davod1ee to DavidLee – so we could do good guess that this was a mention of our professor David Lee (Needleman-Wunsch measure)

Today’s topic is: how to instruct the computer that Alex Jso is similar to Alexander Johansson?

Let’s start with looking why classics is not enough:

  • EDIT DISTANCE (Levenshtein distance) = 16 (mammamia! Who said they are similar?)

Affine-0.PNG

(http://www.convertforfree.com/levenshtein-distance-calculator/)

  • Needleman-Wunsch measure = -6 (scoring match = 2; mismatch = -2; gap = -2) – mnja. PICADILLA-CROCODILE has better NW measure -4, are they really more likely to be similar?

Affine-1.PNG

(http://experiments.mostafa.io/public/needleman-wunsch/)

The Affine Gap measure solves this problem – continuing the gap has less penalty than opening a gap, and we can even modify that the longer gap is continued, the lesser penalty becomes

Exactly what we need in our ALEX – ALEXANDER and JSO – JOHANSSON case. This method is an extension to Needleman-Wunsch measure for handling longer gaps.

In Needleman-Wunsch there were match, mismatch and gap penalty scores involved, and each symbol gap was penalised equally and summary cost of any gaps was too high:

ALEX to ALEXANDER is 5 x gap penalty = 5 x -2 = penalty -10, J to JOHAN is  4 x -2 = penalty -8 for this gap, and two more penalties for other gaps S to SS, O to ON, so the total penalty for all gaps was 11 x -2 = -22!

In real life this is a very common situation data have gaps longer than one character, and affine gap algorithm distinguishes

  • cost of opening the gap of any length (applied to the first character of gap, like A)
  • cost of continuing the gap when there is more than one _ (empty space) in a row (applied to remaining characters of a gap like NDER)

This algorithm is called extension because it keeps the match and mismatch scoring system idea from Needleman-Wunsch, and adds sophisticated rules for gaps.

NB: the affine gap algorithm result will consider this alignment better

Affine-2

than

Affine-3.PNG

because second option opens more gaps (remember, opening a gap has higher penalty)).

For example:

  • gap open score = -3
  • gap extension score = -1

ALEX to ALEXANDER is 1 x gap opening penalty + 4 x gap continuing penalty = 1 x -3 + 4 x -1 = -7, J to JOHAN is  1 x -3 + 3 x -1 = penalty -6 for this gap, and one more penalty for SO to SON -3, so the total penalty for all gaps was -7 + -6 + -3 = -16 (see, even with greater score for openo=ing a gap it is better than -22 when all the gaps were penalised equally).

If we’d use the score -2 for opening and -1 for continuing, we even have -13, near twice as good as -22.

This topic is hard to explain in details – better try understand the business behind and. when needed, just follow formulas :) If I have done the full calculations, I would have for more fun considered scoring where

  • mismatch (changing one symbol to other) costs by default 2
  • mismatch – changing L to E or E to L costs 1 (less because they look similar – I could add any score for any pair of symbols, as well to bonus O and Q, as to punish Q and W)
  • match of symbols costs 5 (it is most preferrable situation for me).

Have a look on formulas used – there are three matrices involved:

Affine-formula

(source: http://pages.cs.wisc.edu/~bsettles/ibs08/lectures/02-alignment.pdf)

Affine-formula2.PNG

(source: https://www.cs.umd.edu/class/fall2010/cmsc423/lectures/Lec04-gaps.pdf)

Affine-formula3.PNG

(source: https://users.info.uvt.ro/~dzaharie/bioinfo2016/proiecte/biblio/MultipleAlignment%2BDP/MultipleAlignment%2BDP.pdf)

When you understand the business behind, the formulas will not seem so complex: a lot combinations possible when we differ gaps starting and extending costs.

a symbol can match a symbol in other string and do not start or close any gaps

A

A

a symbol can mismatch a symbol in the other string however do not start a gap in the other string

because of scoring system used (like we can set that O is the same as 0 or L mismatch to E is not as good as match, but at the same time better than starting a gap)

AO

A0

a symbol in string can start a gap in the other string

AB

A

a symbol in string can continue a gap in the other string

ABC

A – 

a symbol in string can close a gap in this string and do not start a gap the other string

Q – – D

QABD

a symbol in string can close a gap in this string and can start a gap in the other string

– – M

AB – 

NB that we are not considering gap to gap, thus it is not possible situation like

symbol P closes the gap in one string and continues gap in the other – because gap to gap is not possible

– – P

– – –

or “symbol K continues gap in both strings (symbol cannot be both symbol and a gap)”.

Some paragraphs above I wrote that affine gap prefers SSON to -SO-, not to S-O- – this is done by performing calculations always for both strings and both operations:

we calculate not only the score of opening a gap, but also score of ending the gap.

Then, when backtracing the optimal path, we walk through SSON to -SO- path.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Mans viedoklis:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Mainīt )

Google photo

You are commenting using your Google account. Log Out /  Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out /  Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out /  Mainīt )

Connecting to %s

%d bloggers like this: