In my previous blog posts we have learned that

**Crts**is similar to**Cats**– so our autocorrect software might choose this word from vocabulary to correct Crts*(Levenstein distance)***Davod1ee**to**DavidLee**– so we could do good guess that this was a mention of our professor David Lee*(Needleman-Wunsch measure)*

Today’s topic is: how to instruct the computer that** Alex Jso **is similar to **Alexander Johansson**?

### Let’s start with looking why classics is not enough:

- EDIT DISTANCE (Levenshtein distance) =
**16***(mammamia! Who said they are similar?)*

(http://www.convertforfree.com/levenshtein-distance-calculator/)

- Needleman-Wunsch measure =
**-6**(scoring match = 2; mismatch = -2; gap = -2) – mnja. PICADILLA-CROCODILE has better NW measure**-4**, are they really more likely to be similar?

(http://experiments.mostafa.io/public/needleman-wunsch/)

### The Affine Gap measure solves this problem – continuing the gap has less penalty than opening a gap, and we can even modify that the longer gap is continued, the lesser penalty becomes

Exactly what we need in our ALEX – ALEXANDER and JSO – JOHANSSON case. This method is an extension to Needleman-Wunsch measure for handling longer gaps.

In Needleman-Wunsch there were match, mismatch and gap penalty scores involved, and each symbol gap was penalised equally and summary cost of any gaps was too high:

ALEX to ALEXANDER is 5 x gap penalty = 5 x -2 = penalty -10, J to JOHAN is 4 x -2 = penalty -8 for this gap, and two more penalties for other gaps S to SS, O to ON, so the total penalty for all gaps was 11 x -2 = **-22!**

In real life this is a very common situation data have gaps longer than one character, and affine gap algorithm distinguishes

- cost of opening the gap of any length (applied to the first character of gap, like A)
- cost of continuing the gap when there is more than one _ (empty space) in a row (applied to remaining characters of a gap like NDER)

This algorithm is called extension because it keeps the match and mismatch scoring system idea from Needleman-Wunsch, and adds sophisticated rules for gaps.

### NB: the affine gap algorithm result will consider this alignment better

than

because second option opens more gaps (remember, opening a gap has higher penalty)).

For example:

- gap open score = -3
- gap extension score = -1

ALEX to ALEXANDER is 1 x gap opening penalty + 4 x gap continuing penalty = 1 x -3 + 4 x -1 = -7, J to JOHAN is 1 x -3 + 3 x -1 = penalty -6 for this gap, and one more penalty for SO to SON -3, so the total penalty for all gaps was -7 + -6 + -3 = **-16 **(see, even with greater score for openo=ing a gap it is better than -22 when all the gaps were penalised equally).

If we’d use the score -2 for opening and -1 for continuing, we even have -13, near twice as good as -22.

This topic is hard to explain in details – better try understand the business behind and. when needed, just follow formulas :) If I have done the full calculations, I would have for more fun considered scoring where

- mismatch (changing one symbol to other) costs by default 2
- mismatch – changing L to E or E to L costs 1 (less because they look similar – I could add any score for any pair of symbols, as well to bonus O and Q, as to punish Q and W)
- match of symbols costs 5 (it is most preferrable situation for me).

Have a look on formulas used – there are three matrices involved:

(source: http://pages.cs.wisc.edu/~bsettles/ibs08/lectures/02-alignment.pdf)

(source: https://www.cs.umd.edu/class/fall2010/cmsc423/lectures/Lec04-gaps.pdf)

(source: https://users.info.uvt.ro/~dzaharie/bioinfo2016/proiecte/biblio/MultipleAlignment%2BDP/MultipleAlignment%2BDP.pdf)

When you understand the business behind, the formulas will not seem so complex: a lot combinations possible when we differ gaps starting and extending costs.

### a symbol can match a symbol in other string and do not start or close any gaps

**A**

**A**

### a symbol can mismatch a symbol in the other string however do not start a gap in the other string

because of scoring system used (like we can set that O is the same as 0 or L mismatch to E is not as good as match, but at the same time better than starting a gap)

A**O**

A**0**

### a symbol in string can start a gap in the other string

A**B**

A **–**

### a symbol in string can continue a gap in the other string

AB**C**

A – **–**

### a symbol in string can close a gap in this string and do not start a gap the other string

Q – – **D**

QAB**D**

### a symbol in string can close a gap in this string and can start a gap in the other string

– – **M**

AB **– **

### NB that we are not considering gap to gap, thus it is not possible situation like

symbol P closes the gap in one string and continues gap in the other – because gap to gap is not possible

– – P

– – –

or “symbol K continues gap in both strings (symbol cannot be both symbol and a gap)”.

Some paragraphs above I wrote that affine gap prefers SSON to -SO-, not to S-O- – this is done by performing calculations always for both strings and both operations:

**we calculate not only the score of opening a gap, but also score of ending the gap. **

Then, when backtracing the optimal path, we walk through SSON to -SO- path.

Disclaimer

This blog is solely my personal reflections.

Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.

This is neither a formal review nor requested feedback and not a complete study material.