Big Data: string similarity: mind the Gap. The Affine Gap.

In my previous blog posts we have learned that

  • Crts is similar to Cats – so our autocorrect software might choose this word from vocabulary to correct Crts (Levenstein distance)
  • Davod1ee to DavidLee – so we could do good guess that this was a mention of our professor David Lee (Needleman-Wunsch measure)

Today’s topic is: how to instruct the computer that Alex Jso is similar to Alexander Johansson?

Let’s start with looking why classics is not enough:

  • EDIT DISTANCE (Levenshtein distance) = 16 (mammamia! Who said they are similar?)



  • Needleman-Wunsch measure = -6 (scoring match = 2; mismatch = -2; gap = -2) – mnja. PICADILLA-CROCODILE has better NW measure -4, are they really more likely to be similar?



The Affine Gap measure solves this problem – continuing the gap has less penalty than opening a gap, and we can even modify that the longer gap is continued, the lesser penalty becomes

Exactly what we need in our ALEX – ALEXANDER and JSO – JOHANSSON case. This method is an extension to Needleman-Wunsch measure for handling longer gaps.

In Needleman-Wunsch there were match, mismatch and gap penalty scores involved, and each symbol gap was penalised equally and summary cost of any gaps was too high:

ALEX to ALEXANDER is 5 x gap penalty = 5 x -2 = penalty -10, J to JOHAN is  4 x -2 = penalty -8 for this gap, and two more penalties for other gaps S to SS, O to ON, so the total penalty for all gaps was 11 x -2 = -22!

In real life this is a very common situation data have gaps longer than one character, and affine gap algorithm distinguishes

  • cost of opening the gap of any length (applied to the first character of gap, like A)
  • cost of continuing the gap when there is more than one _ (empty space) in a row (applied to remaining characters of a gap like NDER)

This algorithm is called extension because it keeps the match and mismatch scoring system idea from Needleman-Wunsch, and adds sophisticated rules for gaps.

NB: the affine gap algorithm result will consider this alignment better




because second option opens more gaps (remember, opening a gap has higher penalty)).

For example:

  • gap open score = -3
  • gap extension score = -1

ALEX to ALEXANDER is 1 x gap opening penalty + 4 x gap continuing penalty = 1 x -3 + 4 x -1 = -7, J to JOHAN is  1 x -3 + 3 x -1 = penalty -6 for this gap, and one more penalty for SO to SON -3, so the total penalty for all gaps was -7 + -6 + -3 = -16 (see, even with greater score for openo=ing a gap it is better than -22 when all the gaps were penalised equally).

If we’d use the score -2 for opening and -1 for continuing, we even have -13, near twice as good as -22.

This topic is hard to explain in details – better try understand the business behind and. when needed, just follow formulas :) If I have done the full calculations, I would have for more fun considered scoring where

  • mismatch (changing one symbol to other) costs by default 2
  • mismatch – changing L to E or E to L costs 1 (less because they look similar – I could add any score for any pair of symbols, as well to bonus O and Q, as to punish Q and W)
  • match of symbols costs 5 (it is most preferrable situation for me).

Have a look on formulas used – there are three matrices involved:







When you understand the business behind, the formulas will not seem so complex: a lot combinations possible when we differ gaps starting and extending costs.

a symbol can match a symbol in other string and do not start or close any gaps



a symbol can mismatch a symbol in the other string however do not start a gap in the other string

because of scoring system used (like we can set that O is the same as 0 or L mismatch to E is not as good as match, but at the same time better than starting a gap)



a symbol in string can start a gap in the other string



a symbol in string can continue a gap in the other string


A – 

a symbol in string can close a gap in this string and do not start a gap the other string

Q – – D


a symbol in string can close a gap in this string and can start a gap in the other string

– – M

AB – 

NB that we are not considering gap to gap, thus it is not possible situation like

symbol P closes the gap in one string and continues gap in the other – because gap to gap is not possible

– – P

– – –

or “symbol K continues gap in both strings (symbol cannot be both symbol and a gap)”.

Some paragraphs above I wrote that affine gap prefers SSON to -SO-, not to S-O- – this is done by performing calculations always for both strings and both operations:

we calculate not only the score of opening a gap, but also score of ending the gap.

Then, when backtracing the optimal path, we walk through SSON to -SO- path.


This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Mans viedoklis:

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Mainīt )

Google photo

You are commenting using your Google account. Log Out /  Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out /  Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out /  Mainīt )

Connecting to %s

%d bloggers like this: