## Big Data: set similarity : q-grams, Overlap measure, Jaccard index, Jaccard distance

Today’s topic: is John Smith similar to Smith John? As you see, Edit distance 10 will measure as not similar, so as Needleman-Wunsch (score = – 5). Smith Waterman would just point to the two similar substrings. Hmm, are these strings really not similar?

I asked my kid: how could we find guys of which class have the most similar names to your class? He answered: well, we could take name by name and calculate Edit distance score for each and then sum results together?

Let’s have a look closer on some official methods.

Class one pupils {Katrina,Janis,Karlis,Martins,Anna,Karlina}

Class two pupils {Annija,Martins,Marta,Karolina,Ance}

### First of all, let’s learn to obtain the set of comparable items – tokens.

Let’s start having names as our tokens. Set similarity operates with tokens, not “words” because words sound like are something meaningful for us. Tokens are… just tokens :) anything we have agree how to split the set – you’ll see several approaches.

Simple: if we use word a s token then string “David Alexander Smith” consists of three tokens: david, alexander, smith.

A bit more complex: we might agree that we during tokenisation remove common stop words like “of, the, and” or ” Mr, Ms, fon, da” etc.

Then string “Mr. David Alexander fon Smith” again consists of three tokens: david, alexander, smith.

## q-grams or n-grams

This impressive name just means that we split the string in substrings of length q. I’ll use David and show you two approaches used – which approach you choose  later depends on your needs and further usage of results.

q=1 (unigram):

• {D,a,v,i,d}

q=2 (bigram, sometimes called digram):

• first approach: {#d,da,av,vi,id,d#} (I used the special character # to pad for the length)
• second approach: {da,av,vi,id}

q=3 (trigram):

• {##d,#da,dav,avi,vid,id#,d##}
• {dav,avi,vid}

q=4 (four-gram):

• {###d,##da,#dav,davi,avid,vid#,id##,d###}
• {davi,avid}

q=5 (five-gram):

• {####d,###da,##dav,#davi,david,#avid,##vid,###id,####d}
• {david}

I used R for some tests, see it does the second approach:

### Similar strings have many common q-grams.

You might find it interesting that translation software most likely detects language by comparing q-grams to the statistical results of languages. This my blog entry could be split in bigrams and would have result “English” returned.

Q-gramming may be applied to word level also.

The quick brown fox jumps over a lazy dog.

bigram:

• {# the,the quick, quick brown,brown fox, fox jumps,jumps over,over a,a lazy,lazy dog,dog #}
• {the quick, quick brown,brown fox, fox jumps,jumps over,over a,a lazy,lazy dog}

trigram

• {# # the, # the quick,the quick brown,quick brown fox,brown fox jumps,fox jumps over,jumps over a,over a lazy,a lazy dog,lazy dog #,dog # #}

etc.

Note it is a analyst’s (my, yours) decision if the method used requires some data cleaning like removal of punctuation, whitespaces, letter de-capitalising etc.

## The overlap measure

Let’s do some simply examples for learning the idea: string 1 will be Carlo, string 2 Carol.

If I use bigrams for tokenisation.

Let S1 = set of tokens of string 1.

• {#c,ca,ar,rl,lo,o#}

Let S2 = set of tokens of string 2

• {#c,ca,ar,ro,ol,l#}

### Overlap(S1,S2), O(S1,S2) = number of common tokens.

In our case common tokens are #c,ca,ar

O(S1,S2) = 3

If I use trigrams for tokenisation.

• {##c,#ca,car,arl,rlo,lo#,o##}
• {##c,#ca,car,aro,rol,ol#,l##}

O(S1,S2) = 3

## The Jaccard index a.k.a coefficient a.k.a measure

Well, you saw the Overlap measure. What is it? It actually has no value itself without context. We need to make it usable. Let’s have an index always 0..1 – the more to 1, the higher is set similarity. I’ll also multiply by 100 to have percents.

Jaccard(S1,S2) = count of common tokens  in strings S1,S2 / count of all unique tokens in both S1,S2

Or rephrasing a bit more scientific:

Jaccard(S1,S2) = intersect of tokens S1,S2 / union of tokens S1,S2

Note: ‘union’, not ‘union all’. Union are all unique, while union all are all.

Jaccard(S1,S2) = overlap of tokens / total count of unique tokens

Knockout now with The Formula :)

String 1 Carlo, string 2 Carol.

### If I use bigrams for tokenisation:

• S1 = {#c,ca,ar,rl,lo,o#}
• S2 = {#c,ca,ar,ro,ol,l#}
• Overlap(S1,S2) = 3
• total count of unique tokens = 9

Jaccard(S1,S2) = 3/9 = 0.33 = 33%

Visualisation from http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html

### If I use trigrams for tokenisation:

• S1 = {##c,#ca,car,arl,rlo,lo#,o##}
• S2 = {##c,#ca,car,aro,rol,ol#,l##}
• Overlap(S1,S2) = 3
• total count of unique tokens = 11

Jaccard(S1,S2) = 3/11 = 0.27 = 27%

### Interpreting the result:

• Two sets having all tokens as common will be 100% similar. The closer to 100%, the more similarity (e.g. 90% is more similar than 89%).
• If they have no common tokens, they are 0% similar.
• The half –  50% – means that the two sets have common half of the members.

## The Jaccard distance

Measure of how different or dissimilar two sets are. It is the complement of the Jaccard index and can be found by subtracting the Jaccard index/measure/coefficient from 1 (or percentage from 100%).

For the above example with bigrams, the Jaccard distance is 1 – 33% = 67%.

Trigrams have Jaccards distance = 100% – 25% = 75%

### So what?

Here you might ask: hey, but you showed they are not usable at all! Carlo IS similar to Carol!

Let me remind I was showing the idea on a very simplified example and we are talking about set similarity, not string similarity.

To treat as a string, let’s use Jaccard for Carlo and Carol having q=1

• {c,a,r,l,o}
• {c,a,r,o,l}

Voila! Jaccard measure = 1, Jaccard distance = 0%.

## Now back to sets and names in classes.

### Tokenisation method 1 – names are tokens:

S1={Katrina,Janis,Karlis,Martins,Anna,Karlina}

S2={Annija,Martins,Marta,Karolina,Ance}

O = 1 (Martins)

Total count of unique tokens = 10

Jaccard measure = 1/10 = 0.1 = 10%

Jaccard distance = 90% (remember: not similarity, but dissimilarity)

Conclusion: by method 1 names in these classes are not similar.

And now let’s calculate the same using bigrams :)

### Tokenisation method 2 – bigrams are tokens:

I used R for splitting to bigrams

and Excel for visualisation

Jaccard measure = 15/30 = 0.5 = 50%

Jaccard distance = 100% – 50% = 50%

Conclusion: by method 2 names in these classes are half similar.

### Tokenisation method 3 – trigrams are tokens:

Jaccard measure = 11/43 = 0.26 = 26%

Jaccard distance = 100% – 26% = 74%

Conclusion: by method 3 names in these classes are quite a little similar.

Have you got the idea? Now you can apply the same method on different classes and have a reliable method to measure similarity.

## Which one to choose?

The method which suits YOU the best. I can’t predict, will you consider Marta and Martins similar or dissimilar in YOUR business case.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

## Big Data: string similarity: dealing with typos (Jaro meausre, ups, measure)

Have you evre had a typo? How to instruct the computer that Luxmeb is most likely meant Luxembourg? (try to write in google and, see, it knows! :) let’s have look – how.

For this type of short strings like words Jaro measrue, ups, Jaro measure or Jaro distance or Jaro score helps.

The higher the Jaro distance is for two words is, the more similar these words are. The score 0 means no similarity and 1 means an exact match. Sometimes result is used in percents, I’ll show both.

The Jaro measure idea is assumption that

### if one word contains near the same letters as the other and these letters can quite easily be reordered

to match – like length and lenght – or teh/the, dacnign/dancing

### –  then words are similar.

Values used in the calculation of the Jaro Distance:

1. Length of words being compared
2. Count of common characters
3. Count of transpose (exchange) operations needed to reorder common characters to equal order

Official formula looks complex, like scientists usually do to impress us:

• s1 = Jon length 3
• s2 = John length 4
• m = 3 (three common characters J, o, n
• t = 0 (no symbol transpositions needed Jon to John, because matching symbols are already in the same order)

jaro(Jon,John) = 1 / 3 x (3 matching sumbols /3 length of Jon + 3 matching symbols / 4 length of John + ((3 matching symbols – 0 transpositions)/3 matching symbols) = 1 / 3 x (3/3 + 3/4 + 3/3) = 1/3 x (1+ 0.75 + 1) = 1/3 x 2.75 = 0.917 (quite similar) or 92%

Edit distance of Jon, John = 1. Levenstein Distance Measure is  1 – d(x,y) / [max(length(x), length(y))] : s(Jon,John) = 1 – 1 / [max(3, 4)] = 1 – 1/4 = 0.75 or 75% similarity

### You see, Jaro score 92% is better than Levenshtein’s 75%.

and also better than

• Needleman-Wunsch 75% (scoring 1;-1;-1) (three matches 1×3 + one gap penalty -1 = 2)
• Smith-Waterman 83% (Jo and Jo, n and n are the best matching substrings)

(calculator to play – https://asecuritysite.com/forensics/simstring)

### Let’s calculate Jaro now for Jon and Ojhn

• s1 = Jon length 3
• s2 = Ojhn length 4
• m = 3 (three common characters J, o, n)
• t = 1 (transpositions needed to reorder common symbols (ojn – jon)

Jaro = 1 / 3 x (3/3 + 3/4 + (3-1)/3) = 0.81 or 81% similarity

### Jaro for Jon and Ohnj (I reordered symbols)

• s1 = Jon length 3
• s2 = Ohnj length 4
• m = 3 (three common characters J, o, n)
• t = 2 (transpositions needed to reorder common symbols (onj – ojn – jon)

Jaro = 1 / 3 x (3/3 + 3/4 + (3-2)/3) = 0.70 or 70% similarity

Levenstein distance for JON, OHNJ is 3, measure is 1 – 3/4 = 0.25 or 25% similarity. See the difference?

### Jaro for Luxmeb and Luxembourg

• s1 = Luxmeb length 6
• s2 = Luxembourg length 10
• m = 6 (seven common characters L,u,x,e,m,b)
• t = 1 (transpositions needed to reorder common symbols:  Luxmeb – Luxemb

Jaro = 1 / 3 x (6/6 + 6/10 + (6-1)/6) = 0.81 is 81% similarity

Edit distance result is 1 – 6 / [max(6,10)] = 1 – 6/10 = 0.4 is 40% similarity.

### A variation to handle prefixes is Jaro – Winkler measure.

In real life this is a very common situation the strings have the same beginning (I bleieve you all konw tihs jkoe). Like beginnign, Luxmebogr, lenhgt, meausre. To take that into account and improve scoring, there is a measure

• dw is Jaro-Winkler measure result
• dj is Jaro distance – like we calculated before
• l is the length of longest common exact matching prefix (often referred as max possible 4)
• p is the weight given to the prefix (how important it is for us) (default use to be 0.1)

Another way to express the same formula is

• jaro-winkler(x,y) = (1 – PL*PW)*jaro(x,y) + PL*PW

where

• PL = length of the longest common prefix
• PW is a weight given to the prefix

### Let’s calculate Jaro-Winkler for Luxmebogr and Luxembourg.

• I’ll assume that prefix weight is recomended 0.1
• Jaro distance (s1=9, s2=10, m=9, t=1) = 0.91 (91%)
• Longest common exactly matching prefix is Lux

Jaro-Winkler distance is

• 0.91 + (3×0.1x(1-0.91)) = 0.94 or 94% similarity (first formula)
• (1-3×0.1)x0.91 + 3×0.1 = 0.94 or 94% similarity (second formula – of course, the same because formulas are equal, just written in another way)

### Jaro-Winkler for Luxmeb and Luxembourg

Jaro was 0.81 or 81% similarity

0.81 + (3×0.1x(1-0.81)) = 0.87 or 87% similarity

Can’t believe it is so easy, can you?

There are many more string similarity measures and algorithms in the world, different time elapsed, different precision and a lot of adaptions for specialized cases like nucleotides.

However, if you have read all my string similarity posts, you should have a good foundation of the basic ones.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

## Big Data: string similarity: best matching substrings between two strings (Smith-Waterman algorithm)

Previous methods I wrote about considered matching all characters to all characters between two strings (when entire string is aligned, it is called global alignment). As you see from our new challenge, this is not suitable in some cases:

can we somehow find out that Mr. Alexander Fitzgerald, University of Canterbury is similar to Professor, Dr.sc.comp Alexander Williams Fitzgerald from Canterbury, New Zealand?

Edit distance = 54 (https://planetcalc.com/1721/) – very long.

Needleman-Wunsch: -32 (http://bioinformaticnotes.com/Needle-Water/) similarity quite low.

### Better idea: find two substrings of given strings that are the most similar one to another

(when parts of strings are compared, it is called local alignment). We can already guess the best matching one will be ‘Fitzgerald’ but computer must be provided with logic. Method I am blogging about was first proposed by Temple F. Smith and Michael S. Waterman in 1981.

The main difference to the Needleman–Wunsch algorithm is that negative scoring matrix cells are reset to zero, thus now a match can be restarted at any position, not only the very corner. It means if you learn to calculate Needleman-Wunsch, it is easy to adapt to Smith-Waterman. Just stay positive :)

#### The idea is simple, yet brilliant:

1. add one more condition to Needleman – Wunsch: the lowest possible value is 0, so the result will never be negative:

2.  when bactracking for the alignment, just find the highest value and go up until you reach 0. Repeat searching for the next highest as many times you want.

• Best match is AWGHE to AW_HE
• Next best: HEA to HEA

Let’s look on a sample trying to fit the picture in the screen:

• PROFESSOR WILLIAM SMITH DR.SC.COMP CANTERBURY
• DUDE WILL SMITH NEIGHBOUR

I believe, YOU see that best matching substrings are WILL and SMITH. But computer don’t have the brains as you have. By using Smith-Waterman measure, the computer can only calculate it by predefined algorithm.

You probably cannot see there but in this picture SMITH to SMITH max result in the bottom-right is 18. WILL to WILL is 10.

Here comes the full table of original example – ALEXANDER, FITZGERALD, CANTERBURY:

Let’s add some foreign language for fun – you’ll see that also Smith  and Waterman were not magicians, however we can do a good guess with their algorithm:

• DR.ALEKSANDRS VILJAMS FICDŽERALDS KENTERBERIJA
• MR. ALEXANDER FITZGERALD CANTERBURY

and we see most similar substrings:

• KENTERBERI-CANTERBURY
• ERALDSKENTERB-ERALD_CANTERB
• FI-FI
• ALEKSANDRS VI – ALEX_ANDER FI

When you google, you’ll see a lot of bioinformatics and related samples TGCTACCGTAA….. I prefer a bit more human readable examples.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

## Big Data: string similarity: mind the Gap. The Affine Gap.

In my previous blog posts we have learned that

• Crts is similar to Cats – so our autocorrect software might choose this word from vocabulary to correct Crts (Levenstein distance)
• Davod1ee to DavidLee – so we could do good guess that this was a mention of our professor David Lee (Needleman-Wunsch measure)

Today’s topic is: how to instruct the computer that Alex Jso is similar to Alexander Johansson?

• EDIT DISTANCE (Levenshtein distance) = 16 (mammamia! Who said they are similar?)

• Needleman-Wunsch measure = -6 (scoring match = 2; mismatch = -2; gap = -2) – mnja. PICADILLA-CROCODILE has better NW measure -4, are they really more likely to be similar?

### The Affine Gap measure solves this problem – continuing the gap has less penalty than opening a gap, and we can even modify that the longer gap is continued, the lesser penalty becomes

Exactly what we need in our ALEX – ALEXANDER and JSO – JOHANSSON case. This method is an extension to Needleman-Wunsch measure for handling longer gaps.

In Needleman-Wunsch there were match, mismatch and gap penalty scores involved, and each symbol gap was penalised equally and summary cost of any gaps was too high:

ALEX to ALEXANDER is 5 x gap penalty = 5 x -2 = penalty -10, J to JOHAN is  4 x -2 = penalty -8 for this gap, and two more penalties for other gaps S to SS, O to ON, so the total penalty for all gaps was 11 x -2 = -22!

In real life this is a very common situation data have gaps longer than one character, and affine gap algorithm distinguishes

• cost of opening the gap of any length (applied to the first character of gap, like A)
• cost of continuing the gap when there is more than one _ (empty space) in a row (applied to remaining characters of a gap like NDER)

This algorithm is called extension because it keeps the match and mismatch scoring system idea from Needleman-Wunsch, and adds sophisticated rules for gaps.

### NB: the affine gap algorithm result will consider this alignment better

than

because second option opens more gaps (remember, opening a gap has higher penalty)).

For example:

• gap open score = -3
• gap extension score = -1

ALEX to ALEXANDER is 1 x gap opening penalty + 4 x gap continuing penalty = 1 x -3 + 4 x -1 = -7, J to JOHAN is  1 x -3 + 3 x -1 = penalty -6 for this gap, and one more penalty for SO to SON -3, so the total penalty for all gaps was -7 + -6 + -3 = -16 (see, even with greater score for openo=ing a gap it is better than -22 when all the gaps were penalised equally).

If we’d use the score -2 for opening and -1 for continuing, we even have -13, near twice as good as -22.

This topic is hard to explain in details – better try understand the business behind and. when needed, just follow formulas :) If I have done the full calculations, I would have for more fun considered scoring where

• mismatch (changing one symbol to other) costs by default 2
• mismatch – changing L to E or E to L costs 1 (less because they look similar – I could add any score for any pair of symbols, as well to bonus O and Q, as to punish Q and W)
• match of symbols costs 5 (it is most preferrable situation for me).

Have a look on formulas used – there are three matrices involved:

(source: https://users.info.uvt.ro/~dzaharie/bioinfo2016/proiecte/biblio/MultipleAlignment%2BDP/MultipleAlignment%2BDP.pdf)

When you understand the business behind, the formulas will not seem so complex: a lot combinations possible when we differ gaps starting and extending costs.

A

A

### a symbol can mismatch a symbol in the other string however do not start a gap in the other string

because of scoring system used (like we can set that O is the same as 0 or L mismatch to E is not as good as match, but at the same time better than starting a gap)

AO

A0

AB

A

ABC

A –

Q – – D

QABD

– – M

AB –

### NB that we are not considering gap to gap, thus it is not possible situation like

symbol P closes the gap in one string and continues gap in the other – because gap to gap is not possible

– – P

– – –

or “symbol K continues gap in both strings (symbol cannot be both symbol and a gap)”.

Some paragraphs above I wrote that affine gap prefers SSON to -SO-, not to S-O- – this is done by performing calculations always for both strings and both operations:

we calculate not only the score of opening a gap, but also score of ending the gap.

Then, when backtracing the optimal path, we walk through SSON to -SO- path.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

## Big Data: string similarity – match, mismatch and gap scores come into play (Needleman-Wunsch measure)

Imagine big data flowing in your system. Reviews, articles, blogs, comments, logfiles… day by day, unpredictable quality. And you there in the middle trying to solve:

• Is Dave Smith in review the same as David Smith in article?
• Is Davod R. Smith the same as Daivd Robert Smyth?
• Could it be Davod Snoth in comment is the same as Davidd Smith in review? (notice I and O, N and M near on the keyboard)
• Is Professor David Robert Smith-Smithsson in one article the same as David R. Smyth in another article?
• Could it be Dvae Smitt is the same as David Smith?
• Is Dav1d R0berl 5mith (it might be OCR output) the same as Daivd R. Smith?
• Is Daid Smith the same as Davids Smits (latvian article)?
• Is his job address Kr Barona street 117 the same as 117, Krisjana Barona street?

Scientists have worked quite hard to give a helppful hand, and I will briefly overview some of string similarity measurements and methods because I think it is nice to know that no magic – pure logic.

One blog post earlier I invested a time to teach myself and you, dear readers, to calculate EDIT DISTANCE. It was worth it because the next measure I will discuss about, is Needleman-Wunsch Measure developed by Saul B. Needleman and Christian D. Wunsch in 1970.

I will remind that in EDIT DISTANCE or LEVENSTEIN distance case we had a tangible result: resulting number was count of insert/delete/substitute operations needed to transform one string to another.

Here are two strings. We, humans, might think it is a high change the object is the same – David Smith. When we see EDIT DISTANCE value 8 for 14 symbols long string, it signalizes that the strings are not similar – and technically they definitely are not:

https://andrew.hedges.name/experiments/levenshtein/

Similarity measure is 1-8/max(14,18) = 1-8/18 = 0.55 – we have to have really loose boundaries to consider these strings similar using Levenstein distance :)

Let’s have a look what some other scientists have done!

Needleman-Wunsch measure you will hear mostly within biology context, comparing DNA sequences. I will make its explanation more human centred, using words.

Needleman-Wunch measure is mathematically proven to find the fewest number of mutations of two sequences, identifying matching parts and the changes needed to transfer one sequence into the other.

Please note that now we will talk about the measure. This is not tangible count of neither I/D/S operations, nor parrots, symbols or atoms. The benefit of using a measure is ability to compare. We will use the same method of comparison to different strings and will use this output measure to evaluate bad-good-better-best matching strings.

I will give also lot of examples not to just another theoretical blablabla. I can’t thank enough this site: http://experiments.mostafa.io/public/needleman-wunsch/

Usually you’ll use samples from biology like TCGAC and TCATA but I’ll use words. Let’s have a look using the same PICADILLA and CROCODILE.

The basic idea within Needleman-Wunsch measure is advanced scoring system which is used in calculation rules:

• match score. If letters match, we assign them a matching score, usually defaulted as is 1, Thus we will reward the match of symbols – we will increase their rating
• mismatch score. If letters don’t match, mismatch score is usually defaulted as -1. Thus we will punish the mismatch of symbols – we will decrease their rating
• gap penalty. The gap score or gap penalty is usually defaulted as -1. Thus we will define, what is preferred: a gap (empty space) or a mismatch (another symbol).

Let’s start with the defaults for scores in our example – I’ll change them later and show you the impact:

• match score 1 (you see, that I like match the most)
• mismatch score -1
• gap penalty -1 (you see, I have set them equal – I don’t care, is it a gap or a mismatch)

The idea of dynamic calculation of Needleman-Wunch measure is similar to EDIT DISTANCE, just prepare a matrix with both strings and follow predefined (slightly different from edit distance) rules and take the maximum (in Edit distance we took the minimum) value from one of left, diag, up.

In addition to EDIT DISTANCE, we must set initial values in matrix – add gap penalties, see the grey line and column.

To calculate the first cell, the green :

• add the gap penalty to upper cell value
• add the gap penalty to left cell value
• in diagonal, we must use match and mismatch parameters
• If letters do not match, add the mismatch score to the diag value
• If letters match, add the match score to the diag value
• When doing that, also put a note, from which cell the final value was taken as max from three. If we can take value from more than one cell, mark them all.

Fill the empty row and empty column (initialize the matrix), only after it we can do the crosses.

Well, you saw the idea, now let’s have the whole matrix done (I am using http://experiments.mostafa.io/public/needleman-wunsch/):

Now, what the result means? As we have defined, we consider mismatch score to be the same as gap score and we see that the result is calculated to have the strings aligned in the best optimal way: there are

• two gaps -O, A-
• four mismatches P-C, I-R, A-O, L-E
• four matches C-C, D-D, I-I, L-L

What happens if we change the scoring system? Let’s switch that

### I prefer gap over mismatch:

• match score 1 (you see, I still like match the most)
• mismatch score -2
• gap penalty -1 (you see, I now prefer gap better than a mismatch)

Here you the impact when I consider gap is better than mismatch:

• ten gaps
• no mismatches – because we prefer gaps!
• four matches

Let’s continue playing with the scoring system. Let’s set that

### I don’t care for gaps and they are as nice as matches:

• match score 1 (you see, I still like match)
• mismatch score -1 (I still don’t like mismatch)
• gap penalty 1 (hehe)

and – voila! – algorithm used in that webpage just gapped all the symbols because I set in scoring system I don’t care for gaps.

Let’s go further! Let’s penalize gaps very hard.

### I really do not like gaps

• match score 1 (I still like match)
• mismatch score -1 (I still don’t like mismatch)
• gap penalty -4 (so I think gap is a very, very bad)

And look, we have strings aligned and there are no gaps at all because I have scored that mismatch is better than the gap.

Let’s have a look on our professors now.

Default scores:

### Gap is equally bad as mismatch

Ha, the Needleman-Wunch measure result is better than

### Gap is better than mismatch:

Still better than

### I really do not like gaps

Heh, professors are much worse because I have no chance to avoid gaps and are having these penalties.

Here I will note that from the table with arrows we can track back alignment path, and, when we have two arrows, from algorithm point of view any path is OK. For example

Notice the score is still the same but alignment differs. Calculation result is the same, either we choose replace E and gap D or gap E and replace D. If you see calculators in web pages where only one result is shown it means their programmers have voluntarily built in additional logic, preferring something.

## Needleman – Wunsch algorithm is neutral. Programmers are not.

Now some bonus to Needleman – Wunch measure: we may add

## additional scoring matrix to reward similar symbols like O and 0.

You may notice, the letters I and O are near on the keyboard, also letter O is similar to number 0, and lowest letter l is similar to I. We may define that mismatches (I and O) and (O and 0) and (L and I) are not the same bad as, eg, (A and M) mismatch. This is done by adding one more scoring lookup table for mismatch score like

here I have set that I do not penalise I and O, O and 0, L and I mismatches at all, and penalty is bigger for A and M mismatch. In this way you will easily find out that

• Davod and David
• Dav0d and Davod
• AIA and ALA (aIa and ala)

are very similar.

As you see, I haven’t added that Dav0d and David are very similar – then I must add one more mismatch value. Now I will not penalise also I and 0 mismatches. As 0 and 0, I give them higher score because I still prefer the real match over my pseudo matches.

Hmm, I think I also should add 1 and l mismatch not to be penalized. Then I might decide also 5 and S, 6 and G and so on – it depends on my software, my data and my business.

Got the idea?

## See the example now.

Let’s compare DAVOD1 with DAVIDL.

I drafted an additional MISMATCH matrix where I define some exceptions to default mistach score. I will consider

• O and 0 the same score as match, L and 1 as match
• I and O (they are near on keyboard), I and L, I and 1 near as match

Here is the my refined result according to my additional score matrix – Needleman-Wunsch measure is 11

And here it would look like if only defaults used – the Needleman-Wunsch measure is 6.

You see my refined result is considered significantly more similar (max possible would be value of 12 and my refined result had 11)

See you in next posts, there are a lot of similarity measures yet.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

## Big Data: one of string similarity measures: EDIT DISTANCE (or Levenshtein distance)

Have you ever thought how does spellchecker, search engine, translation, speech recognition etc. software find replacement options for the word you entered?

User typed:

Crts

What would be YOUR software action? Correct to

• Arts?
• Cats?
• Carts?
• Cuts?
• Cryptos?
• Crots?
• Rats?
• Crtl?
• CRC?

I hope, you’ll answer me – hey, it depends! Of course, it depends :) but how does your software know it? Is there some magic? Ah, no, only logic…

Here are some examples again: Google; Word + language English, Word + language Latvian

Have you ever wondered how are the data found which refer to the same real-world objects? Schwarzenegger, Svarceneger, Švartcnegers, Švarcenegers – why is the searching software so smart and understands that we are looking for the old, good Arnold?

## Similarity measures

One of similarity measures – not the only one! – is Levenshtein distance, named after the Russian scientist Vladimir Levenshtein, often called also EDIT DISTANCE or minimum edit distance, the operation count needed to convert one phrase into the other. If you have enough patience to read this post, you will be able to calculate the Levenshtein distance yourself or at least understand jow the online calculators work.

The smaller is edit distance, the higher is the chance that words are similar.

NB: outside this blog entry there are other measures – Needleman-Wunch, Smith-Waterman, Jaro, Jaro-Winkler, Affine gap etc.

Have you ever played to see when Google switches to offering other words and trying to guess the reason?

Change a bit and have different set:

Have you been playing with spellchecker options? I enjoy guessing algorithms and logic behind.

If you work in data integration area, this type of tasks might be one or your “business as usual” tasks – recognizing that customer “David Smith” is the same as “David R. Smith” and “Davod Smit”. Are addresses “K. Barona 117”and “117, Krisjana Barona street” the same?

The general term is string matching challenge.

Look, different software, but the option offered to Picadilla is the same :)

The minimum edit distance is the minimum number of editing operations (insert, delete, replace) to transform one string into the other (doing that vice versa, the second string will be converted back to the first).

## Edit Operations

Deletion, insertion, and replacement (or sometimes called substitution) operations of characters may have assigned different weights. The usual choice is to set all three weights to 1, however different weights allow more flexible search strategies in lists of words. For example, you may consider that substitution costs 2. Or you may prefer insertion to deletion, by setting insert weights 1, delete weights 2.

• Insertion weight = 1
• Cts -> Cats : edit distance = 1
• Cts -> Cites : edit distance = 2
• Cts -> Cities : edit distance = 3
• Cts -> Citizens : edit distance = 5
• Deletion weight = 1
• Carts -> Cats : edit distance = 1
• Carts -> Car : edit distance = 2
• Carts -> as : edit distance = 3
• Carts -> a : edit distance = 4
• Substitution weight = 1
• Crts -> Cats : edit distance = 1
• Crts -> Cups : edit distance = 2
• Crts -> Avio : edit distance = 4

If we change weight – eg, by doing that we might rank up similar words where only delete required:

• Substitution weight = 2 (consider it as equal to insert + delete)
• Crts -> (insert a, delete r) Cats : edit distance = 2
• Crts -> (insert up, delete rt)Cups : edit distance = 4
• Crts -> (insert Avio, delete Crts) Avio : edit distance = 8

### Full replacement – too easy for us

What we can see quite obvious, is that full replacement works always.

• QQQ (length 3) -> SYNCHRO PHAZOTRON (length 17) max edit distance is 3 deletes + 17 inserts = 20 operations
• PSYCHO (length 6) -> SYNCHRO PHAZOTRON (length 17) max edit distance is 6 + 17 = 23. But I see some characters are the same!
• ROTORON (length 7) -> SYNCHRO PHAZOTRON (length 17) max edit distance is 7 + 17 = 24. As we see, all characters are the same, is it really 24 operations necessary?
• STUDENT (length 7) -> TUDENTS (length 7) max edit distance is 7 + 7 = 14. Wait, it can’t be true, we just need to move one letter! Does the computer know this?

It means, if we need just to change strings, no worries, delete + insert (overwrite) and forget edit distance at all.

If we are going to add value, let’s understand the minimum edit distance. It is not just ANY sequence of operations to get the desired result – transformed one sting to another. This is THE MINIMUM number of operations needed. How can we find it? I’ll show you both guessing method and scientific method.

Let’s start with a very simple example. What’s the edit distance between

• ACE
• BASE

I believe you see the answer right now and say – only two operations needed, insert B and replace C to S. I just wanted to start with this very simple example because scientists for several decades were trying to find out the faster way how to find the edit distance, without calculating values for the whole matrix. They just believed – as I did – there should be a faster way instead of calculating the whole matrix! You will find the sad truth at the very end of this post.

So, maximum operations needed would be 7 (delete 3, insert 4). But we are looking for the minimum. To find it in a scientific way, we must fill a matrix, table cell by cell. This matrix size is length of phrase ACE multiplied by length of phrase BASE, in our case 3*4 = 12 cells.

By the way, here you see quadratic time concept: when the length of phrase increases a little, the number of cells in matrix increase much more lot.

If our second phrase will change by 1 symbol from BASE to BASIC, we will have 3*5=15 cells. 1 symbol, 3 new cells.

If by 4 to BASEMENT, then 3*5=24 cells. 4 symbols, 12 new cells. Imagine that now with larger phrases up to thousands… billions of symbols like in DNS sequences.

To find the minimum edit distance, we must fill the table cell by cell for three operations

• INSERT operation, we’ll consider edit cost (weight) as 1
• SUBSTITUTE (replace), we’ll consider edit cost (weight) as 0, if letters are equal, and 1, if letters are different
• DELETION, we’ll consider edit cost (weight) as 1

## Let’s fill the table

So, we draw a table and first we fill so called base columns with operations count

• Transform empty string into string 1, in our case ACE (create string from nothing)
• Transform string 2, in our case BASE into empty string (erase string)

Then we follow rules, filling the other cells:

• If letters are not equal, we pick the minimal value of three cells: upper (insert), diag (left-upper (substitution)), left (delete) and add 1 operation (see three blue cells example for green cell)
• If letters are equal, we copy over substitution cost from left-upper cell.

### Let’s look at another, more cats and zoo related example

Are these words similar? when user typed Picadilla, shall we offer to search for similar pets like Crocodile? a la ‘you might consider searching crocodile’, Did you mean pizza delivery’? Did you mean Piccadilly Circus?

• Or CACTUS? Is Cactus similar at all?

Let’s start investigation, mr. Inspector Caps, if the words PICADILLA and CROCODILE are similar at all.

• As we can set the rules ourselves, we will consider the strings are similar if minimum edit distance is less than 70% of string average length, in our case less than 70% from 9 is between 0..6. It is obvious for US the distance is not 0, because strings are not similar. But computer still needs to find it by following predefined logic.
• We will also use the official similar measure value s(x,y) = 1 – d(x,y) / [max(length(x), length(y))]

We, humans, face there many options now in front of us to transform the words. As I wrote, we always might delete one string and insert another. But we will not do this. We will try think smart. So. Hmm, should we start as

• substitute P -> C
• substitute I -> R

Or maybe better start

• Delete P
• Delete I

## Hmm. Ok. Let’s do it one way:

CROCODILE

• substitute P -> C (1 operation) CICADILLA
• substitute I -> R (1 operation) CRCADILLA
• substitute C -> O (1 operation) CROADILLA
• substitute A -> C (1 operation) CROCDILLA
• substitute D -> O (1 operation) CROCOILLA
• substitute I -> D (1 operation) CROCODLLA
• substitute L -> I (1 operation) CROCODILA
• hurrah, L = L
• substitute A -> E (1 operation) CROCODILE

Totals: 8 operations.

We see 8 is enough. But can we say 8 is the minimum edit distance?

### Let’s try another way.

CROCODILE

• delete P (1 operation) ICADILLA
• delete I (1 operation) CADILLA
• hurrah, C = C
• insert R (1 operation) CRADILLA
• substitute A -> O (1 operation) CRODILLA
• insert C (1 operation) CROCDILLA
• insert O (1 operation) CROCODILLA
• delete L (1 operation) CROCODILA
• substitute A -> E (1 operation) CROCODILE

Heh, again 8. But I am quite sure there MUST be shorter distance. You know, intuition etc.

### Let’s think.

Let’s see the letters we have:

CROCODILE

Length of words is equal (9).

I have to do something with 6 outstanding letters PIAILA or CROOIE. Why did I have distance 8 before? Because I was not thinking at all.

Now:

• substitute P -> I (1 operation) CICADILLA
• substitute I -> R (1 operation) CRCADILLA
• insert O (1 operation) CROCADILLA
• substitute A -> O (1 operation) CROCODILLA
• substitute L -> E (1 operation) CROCODILEA
• delete A (1 operation) CROCODILE

Voila! We did it in distance 6. You might ask is it that simply – edit distance equals unmatching letters? No, it is not as simple, unfortunately.

### Proof

And now let’s fill The Matrix similarly we did before to prove if we are correct.

Voila!

• By our voluntarily definition the words PICADILLA and CROCODILE are similar
• By official s(x,y) = 1 – d(x,y) / [max(length(x), length(y))] : s(PICADILLA,CROCODILE) = 1 – 6 / [max(9, 9)] = 1 – 6/9 = 0.33 , ahemm, not too similar….

As you previously saw, I found the shortest way in third guessing attempt because I did on my intuition. Computers have no intuition. Can I afford this type of guessing if I must transform long strings, eg, billions of symbols long?

I played a bit with online distance calculators.

• Dillapica -> Crocodile, distance = 8
• Alpilidac -> Crocodile, distance = 9

the more of matching letters are not in the same sequence, the longer distance is. In our case matching letters are CDL, so when I put them in opposite order I get the longer distance, in our case max long I could achieve was 6 + 3 = 9.

Hmm, it would be interesting to write a code for fun which calculates longest possible edit distance between two strings when the letters are in different order :)

### Is word CACTUS similar to PICADILLA?

1) Remember, we voluntarily defined similarity as distance is less than 70% of string average length, in this case 70% of (6 + 9)/2 = 5.25, so we will consider the words similar, if minimum edit distance is 0..5. We see the distance is 7 – not similar.

2) By official s(x,y) = 1 – d(x,y) / [max(length(x), length(y))] : s(PICADILLA,CACTUS) = 1 – 7 / [max(9, 6)] = 1 – 6/9 = 0.22 again, not similar.

## Hamming distance

One of metrics, slightly similar to Edit distance, is between two strings of equal length is the number of positions at which the corresponding symbols are different. Other words to describe, are – minimum number of substitutions (replace) to transform one string to the other. Online calculator to play: http://valjok.blogspot.com/2015/08/hamming-distance-online.html

Hamming distance for {PICADILLA,CROCODILE} is 8. (Edit distance is 6)

## Conclusions:

• words PICADILLA and CROCODILE are more similar then CACTUS and PICADILLA by any of measures we used
• we may define our rules where words PICADILLA and CROCODILE are considered similar. Following the same rules, words CACTUS and PICADILLA are not similar, but, if we have a business need, we could define even a measure boundaries where these words are considered being similar.

And now the sad truth: Arturs Backurs, absolvent of LU CS Faculty and MIT graduate proved in this thesis that For 40 years, computer scientists looked for a solution that doesn’t exist.

“For 40 years, computer scientists have tried in vain to find a faster way to do an important calculation known as “edit distance.” Thanks to groundbreaking work from two researchers at MIT, they now know the reason they’ve continually failed is because a faster method is actually impossible to create.”

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

## Big Data: HADOOP ecosystem. There is no one ‘hadoop.zip’ to install

When I was a kid, there was a saying – we say ‘the Party’ and mean Lenin, we say ‘Lenin’ and mean the Party. Today we say Big Data and mean HADOOP which is presumed to be used for processing of 50% of enterprise data all around the world – hundreds to thousands of nodes and petabytes of data is nothing special.

I heard about 8500 computers and 100 petabytes HADOOP cluster in one of training videos.

HADOOP is everywhere: social media, searching tools, government, finances etc. Yahoo, Facebook, Amazon, eBay, IBM, American Airlines etc. P.S. this also mean very high volume of job postings. HADOOP admin salary about 90k – 120k USD/year, developer/data scientist 60k – 150k USD/year.

In one sentence: HADOOP framework is the collection of open source LINUX based software to enable big data flow to be split and processed locally by using parallel, distributed (map-reduce) algorithms across many (one to thousands) low-cost computers and application layer (HDFS) handling hardware faults. HADOOP started its raise since ~Y2003.

• lots of data flowing in at a very high speed,
• large and increasing volume,
• variety of unstructured, different data – logfiles, videos, messages, records, comments, chats
• if you lose a computer, it automatically rearranges and replicates the data – no data loss

HADOOP supports running applications on Big Data like social networking portal or recipes portal or trend analytics of retail.

It took a while to understand that Hadoop is like data batch processing fabrics in backoffice and is a set of tools, each strong in a particular area. Outside the HADOOP there are frontoffice applications who needs something to be done with data – queried, stored etc, eg, searching device or TOP 10 most sold books this week genre sci-fi or retrieving Facebook message. These applications send their tasks to HADOOP as tasks or jobs to be queued and performed by map reduce and the results to be sent back to the calling applications. This is done on data stored in HDFS file system.

Examples of applications where HADOOP is behind the scenes:

• Mining of users behaviour to generate targeted advertisements and recommendations (Apache Mahout)
• Searching and grouping documents, based on certain criteria
• Searching uncommon patterns to detect fraud
• And, of course, Facebook which runs the world’s largest Hadoop cluster. Your messages, likes, shares and comments, they are there, in Facebook data centers and HADOOP. When their previous messaging platform was running our limits they spent weeks testing different frameworks, to evaluate the clusters of MySQL, Apache Cassandra, Apache HBase and other systems. Facebook selected Apache HBase, one of HADOOP family.

Programmers love HADOOP because they do not have to worry about, where data are, what if computers fails, how to split big data and how to break down tasks. If a code works for a megabyte file, it will work for hundreds of gigabytes. If it works on one computer, it will work on 8500 computers. it’s HADOOP business – scalability.

Teamwork. No superhero.

If we use our classical enterprise mindset these data would be handled by a one supercomputer. But everything has its limits, even supercomputers, shall it be processing speed limit or disk space limit.

HADOOP splits and conquers, combining “weakness” of each separate computer (low price, so called commodity hardware)  into a powerful engine. You can add more and more computers (scalability) when your data are growing or requirements changing. The bright side of HADOOP scalability is that count of computers is linear to processing speed: to double the processing speed double the number of computers.

• Data processing part of HADOOP is map-reduce.
• Data storing data is HDFS (HADOOP distributed file storage).

Slaves

Computers incorporated into HADOOP framework are called slaves. Each has processes running on it:

• Tack Tracker: responsible for performing the piece of task assigned to this computer
• Data Node: responsible for managing the piece of data given to this computer

Master (or masters)

As word slaves might have made you think, they have a master. NB: Master computer is not a supercomputer, it is one of commodity hardware like others. This node has the same processes running as each slave – tack tracker and data node, and has yet two additional processes running:

• Job Tracker to break tasks into smaller pieces, send to tack tracker processes, receive results, combine results and send the result back to the calling application
• Name Node to keep an index which data are residing on which node (computer). It tells the calling application which node stores the data required, and the application contacts this node directly, it is not depending on name node to return the dataset. Actually, data flow itself never goes through Name Node, it only points to the storing Node.

Task tracker and Job tracker are puzzle pieces of Map Reduce component.

Name Node and Data Node are puzzle pieces of HDFS.

HDFS

HDFS software is built presuming that hardware fails will definitely happen. It is called built-in fault tolerance.

By default, HADOOP maintains three copies of each data file in different computers. Once a node fails system keeps running and if a node is later restored then HADOOP maintains data copied are spread to this node.

Interesting that fault tolerance applies no only to classical “disk failure” but also to failure of task tracker processes. In case it happens, Job tracker asks another node to perform that job.

Does it sound like Master Computer is single point of failure now?

HADOOP has solution also for Masters failures. The tables masters store data indexes (which data files are on which nodes) are also backed up to different computers. HADOOP can be set to have more than one master where backup master overtakes in case main master fails.

When trying to understand that universe you feel like in the Bold and the Beautiful where everyone has married everyone else for several times and have several backup wives :D

Some of tools within HADOOP ecosystem  – be aware, list is growing

Apache HIVE: data warehouse – query, analysis, summaries. It takes a SQL like language, then converts it to Pig and then to Map-Reduce. Unline SQL, querying HIVE always inserts the results into a table.

Facebook uses this up to 90% of their computations.

I wrote a HIVEQL query which finds all Picadilla page views referred October 2017 by my cat farm main competitor BestCats.com

 ```INSERT OVERWRITE TABLE bestcats_refer_picadilla SELECT picadilla_page_views.* FROM picadilla_page_views WHERE picadilla_page_views.date >= '2017-10-01'  AND picadilla_page_views.date <= '2017-10-31'  AND picadilla_page_views.url_referrer like '%bestcats.com';```

Apache HBASE: people need to read and write their data in real-time. HBASE meets that need: it is a noSQL big data store, providing real-time read/write access to large datasets in distributed environment (as all our data are spread across HADOOP cluster).

HBASE can be accessed by Hive, by Pig and by Map Reduce, and stores the data in HDFS.

Facebook messaging is one of most famous examples of HBASE users. When you send a message to your friend in Facebook, this message actually is an object in an HBASE table. The more messages you send the more nodes Zuckerberg has to add to FB Cluster :)

HCatalog: HBASE table metadata storage, pulled out of HBASE and treated as a separate project for different data processing tools — Pig, MapReduce — to access it, read and write.

Apache Zookeeper: a centralized service for maintaining configuration information. It also stores some of HBASE metadata.

Apache Mahout: machine learning targeted advertisements and recommendations. It will scan the data in system and recommend ‘movies you might like’ or ‘places for you to explore’. It lets you write map reduce applications, focused on machine learning.

Apache Pig: tool for analysing data. It is a high-level language for programming Map-Reduce logic (Pig Latin – similar to SQL). Programmers write high level description and Pig translates it to machine code for Map Reduce and runs that map reduce.

Yahoo experience was that 50% of tasks are written by using Pig instead of direct Map Reduce coding by programmers.

I wrote a very simple script which takes all comments from my imaginary cats site and filters out mentions of my favorite cat Picadilla into a new file picadilla_mentions.

This action might be sent to HADOOP when you open my cats site page and click on Picadilla photo. Then you could have a page opened by telling what a famous cat she is:

```catnotices = LOAD 'cats_site_comments';
picanotices = FILTER catnotices BY \$0 MATCHES '.*PICADILL+.*';

Apache Oozie: workflow scheduler to manage Hadoop batch map reduce jobs run – eg, run this job on this schedule, with this interval etc. Other features, like it can trigger a map reduce job when necessary data appear

Apache Flume: collecting large amount of log data and providing data model for log analytics. Load log data into HDFS and process by map reduce.

Apache Scoop: tool for writing map reduce applications to transfer bulk data between Apache Hadoop and structured databases, like RDBMS (Oracle, for example)

Roles

• Installation
• Monitoring
• Tuning (users help to do that)

Users perform:

• Design and coding apps for certain objectives (admins help to do that)
• Import/export data
• Working with HADOOP ecosystem tools, connectors

Pricing

Since Y2006 HADOOP software is distributed under Apache Software Foundation, an American non-profit corporation formed as a decentralized open source community of developers. The software they produce is distributed under the terms of the Apache License and is free and open-source software. The idea of license is that no particular company controls the HADOOP.

ASF makes money from donations, sponsorship, running conferences,

HADOOP ecosystem vendors use different pricing models: service per node, per terabyte, subscription, hardware plus software deals, selling specialised connectors, selling cloud services and an interesting concept freemium –  free usage till a threshold limit.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.