## Big Data: string similarity: mind the Gap. The Affine Gap.

In my previous blog posts we have learned that

• Crts is similar to Cats – so our autocorrect software might choose this word from vocabulary to correct Crts (Levenstein distance)
• Davod1ee to DavidLee – so we could do good guess that this was a mention of our professor David Lee (Needleman-Wunsch measure)

Today’s topic is: how to instruct the computer that Alex Jso is similar to Alexander Johansson?

• EDIT DISTANCE (Levenshtein distance) = 16 (mammamia! Who said they are similar?)

• Needleman-Wunsch measure = -6 (scoring match = 2; mismatch = -2; gap = -2) – mnja. PICADILLA-CROCODILE has better NW measure -4, are they really more likely to be similar?

### The Affine Gap measure solves this problem – continuing the gap has less penalty than opening a gap, and we can even modify that the longer gap is continued, the lesser penalty becomes

Exactly what we need in our ALEX – ALEXANDER and JSO – JOHANSSON case. This method is an extension to Needleman-Wunsch measure for handling longer gaps.

In Needleman-Wunsch there were match, mismatch and gap penalty scores involved, and each symbol gap was penalised equally and summary cost of any gaps was too high:

ALEX to ALEXANDER is 5 x gap penalty = 5 x -2 = penalty -10, J to JOHAN is  4 x -2 = penalty -8 for this gap, and two more penalties for other gaps S to SS, O to ON, so the total penalty for all gaps was 11 x -2 = -22!

In real life this is a very common situation data have gaps longer than one character, and affine gap algorithm distinguishes

• cost of opening the gap of any length (applied to the first character of gap, like A)
• cost of continuing the gap when there is more than one _ (empty space) in a row (applied to remaining characters of a gap like NDER)

This algorithm is called extension because it keeps the match and mismatch scoring system idea from Needleman-Wunsch, and adds sophisticated rules for gaps.

### NB: the affine gap algorithm result will consider this alignment better

than

because second option opens more gaps (remember, opening a gap has higher penalty)).

For example:

• gap open score = -3
• gap extension score = -1

ALEX to ALEXANDER is 1 x gap opening penalty + 4 x gap continuing penalty = 1 x -3 + 4 x -1 = -7, J to JOHAN is  1 x -3 + 3 x -1 = penalty -6 for this gap, and one more penalty for SO to SON -3, so the total penalty for all gaps was -7 + -6 + -3 = -16 (see, even with greater score for openo=ing a gap it is better than -22 when all the gaps were penalised equally).

If we’d use the score -2 for opening and -1 for continuing, we even have -13, near twice as good as -22.

This topic is hard to explain in details – better try understand the business behind and. when needed, just follow formulas :) If I have done the full calculations, I would have for more fun considered scoring where

• mismatch (changing one symbol to other) costs by default 2
• mismatch – changing L to E or E to L costs 1 (less because they look similar – I could add any score for any pair of symbols, as well to bonus O and Q, as to punish Q and W)
• match of symbols costs 5 (it is most preferrable situation for me).

Have a look on formulas used – there are three matrices involved:

(source: https://users.info.uvt.ro/~dzaharie/bioinfo2016/proiecte/biblio/MultipleAlignment%2BDP/MultipleAlignment%2BDP.pdf)

When you understand the business behind, the formulas will not seem so complex: a lot combinations possible when we differ gaps starting and extending costs.

A

A

### a symbol can mismatch a symbol in the other string however do not start a gap in the other string

because of scoring system used (like we can set that O is the same as 0 or L mismatch to E is not as good as match, but at the same time better than starting a gap)

AO

A0

AB

A

ABC

A –

Q – – D

QABD

– – M

AB –

### NB that we are not considering gap to gap, thus it is not possible situation like

symbol P closes the gap in one string and continues gap in the other – because gap to gap is not possible

– – P

– – –

or “symbol K continues gap in both strings (symbol cannot be both symbol and a gap)”.

Some paragraphs above I wrote that affine gap prefers SSON to -SO-, not to S-O- – this is done by performing calculations always for both strings and both operations:

we calculate not only the score of opening a gap, but also score of ending the gap.

Then, when backtracing the optimal path, we walk through SSON to -SO- path.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.