Posts Tagged ‘Big Data’

Big Data: Phonetic Similarity : Soundex – words are similar if they sound the same


I guess you have seen surnames like Meier and Mayer or Smith, Smyth, Smithe, Smeth, Smeeth. These might be as well correct as misspelt surnames – if you’d dictate them to me by phone, who knows what I’d write.

So far I was blogging about similarity algorithms based on string writing. Today let’s discuss finding a match if the words sound the same. The most commonly used phonetic measure is Soundex. It has several extensions and my today’s topic is

Soundex for English language.

Soundex calculates a four character code from a word  based upon the pronunciation and considers two words as similar if their codes are equal. The idea is that similar sounding letters have are assigned the same soundex code. Widely used in genealogy, in archives, searching ancestors, relatives, families, heirs.

  1. The first character is the starting letter of a word. (in a variation called “Reverse Soundex” prefixes the last letter instead of the first)
  2. Drop all other occurrences of a, e, i, o, u, y, h, w.
  3. Replace consonants after the first letter with digits as follows:
    • b, f, p, v → 1
    • c, g, j, k, q, s, x, z → 2
    • d, t → 3
    • l → 4
    • m, n → 5
    • r → 6
    • If two or more letters with the same number were adjacent in the original name (before step 2), or adjacent except for any intervening h and w, then omit all but the first.
    • Return the first four padded with 0 (padding means replace blanks with 0, like ‘Ahoi’ will have code A000, ‘Dude’ will have D300 – always four characters code).

Let’s have an example set – surnames. I used Oracle RDBMS this time.

Soundex_table.PNG

Now let’s compare similarities by three methods: Edit distance, Jaro-WInkler, Soundex.

Soundex_select.PNG

and here are the results. Notice the combinations we have: if I set a similarity threshold by Edit distance or Jaro-Winkler to 50% then we have several combinations. including false positives and false negatives:

  • all three methods match – like ‘Mirhe’
  • Jaro-Winkler and soundex match, but Edit distance doesn’t – like ‘Meiyar’
  • Jaro-Winkler match but Soundex doesn’t – like ‘Mayes’
  • Edit distance and Jaro-Winkler match but Soundex doesn’t – like ‘Mimre’ or ‘Mirfe’

Soundex_results.PNG

You see, Soundex is not a silver bullet and, as I have always been writing, we must try and test, test ad try.

I’ll show you one more weakness of Soundex:

Scwarz.PNG

From the three approaches I used Soundex is the only one which did not find similarity :)

Scwarz_res

Some of Soundex variants

  • The New York State Identification and Intelligence System (NYSIIS) algorithm maintains relative vowel positioning, while Soundex does not.
  • Daitch–Mokotoff Soundex (D–M Soundex) adaptation to Jews with Germanic or Slavic surnames, sometimes referred as “Eastern European Soundex”. Results of D-M Soundex are returned in an all-numeric format between 100000 and 999999, calculation is much more complex than Soundex.
  • Metaphone, Double Metaphone, Metaphone 3. Powerful and customisable rule set, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations.

I googled online Metaphone calculator, they say It’s more accurate than soundex – hardly can agree:

  • The metaphone code for Schwarzenegger is SXWRSNKR.
  • The metaphone code for Schvartzeneger is SXFRTSNJR.
  • These surnames do not have the same metaphone code.

Then I tried for one of my Soundex similarities and – again

  • The metaphone code for Meiyar is MYR.
  • The metaphone code for Mire is MR.
  • These surnames do not have the same metaphone code.

I was also searching for Soundex Latvian edition – I am quite sure it exists. I found this: http://www.lzp.gov.lv/images/stories/dokumenti/Zin_rezult_2008.pdf

2008. g. izstrādāts un pilveidots elastīgs universālas leksikona sistēmas datubāzes
modelis, kas paredz vienotas infrastruktūras (kopīgu indeksēšanas un atgriezeniskās
saites mehānismu u.c.) un funkcionalitātes (šķirkļu izvērstas meklēšanas un
konfigurējamas atainošanas u.c.) pieejamību visām datubāzē izvietotajām vārdnīcām
neatkarīgi no to šķirkļu shēmām. Attiecībā uz indeksēšanu un meklēšanu, latviešu
valodai tika pielāgots Soundex algoritms, lai nodrošinātu neprecīzi ievadītu, bet pēc
izrunas līdzīgu vārdu atrašanu. (A. Spektors)

P.s. Tiem, kas lasa arī latviešu valodā – šeit ir maziņš un mīlīgs foruma ieraksts, kā cilvēks cenšas izveidot meklēšanas ieteikumu rīku (“vai jūs domājāt XXYZZX?”)  https://exs.lv/say/16261/1441002-so-nakti-pavadiju-veidojot

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Advertisements

Big Data: hybrid similarity measure: the Soft TF/IDF Measure to deal with misspelt words


Some days ago I covered the topic about finding which two of these strings are most likely about the same real world entity – by recognizing requently used words and assigning them lower impact, thus first and third options were found as most similar:

  • Apple Corporation, USA
  • IBM (USA) Corporation
  • Corp. Apple

Let’s add a challenge: misspell the word

  • Apple Corporation, USA
  • IBM (USA) Corporation
  • Corp. Aple

And, as you might imagine, TF/IDF measure is not effective anymore because it cannot recognize Apple is similar to Aple as in classic implementation is looking for equality, not similarity. Today, similar like we did in generalised Jaccard, we will replace the requirement for words to be equal with requirement to be similar.

As I did before with TF/IDF, let’s remove commas, brackets and tokenize our collection into a bag of terms – documents x,y,z.

  • x = {apple,corporation,usa}
  • y = {ibm,usa,corporation}
  • z = {aple,corp}

Full term set T = {apple,corporation,usa,ibm,aple,corp}

Now we will use any string similarity measure of our choice (and business need). I have chosen today Needleman-Wunsch and Jaro-Winkler to illustrate the differences, eg, s(apple,aple) = 0.8 by Needleman-Wunsch and s(apple,aple) = 0.93 by Jaro Winkler. I used calculator https://asecuritysite.com/forensics/simstring

Let’s choose similarity (it is my choice, it could be any other value 0..1)

  • threshold k = 0.55 for Needleman-Wunsch measure
  • threshold k = 0.45 for Jaro-Winkler measure

soft_idf_scores.PNG

Now we will reveal these terms which have similar term in the other document.

Soft TF/IDF based on Needleman-Wunsch

Initial steps are the same as per classic TF/IDEf. Calculate:

Frequency (TF)

is the number of times each term appears in each document.

Full term set T = {apple,corporation,usa,ibm,aple,corp} (as you see we will have six dimension vectors)

TF(apple,x) = 1 (document x has one time apple)
TF(apple,y) = 0
TF(apple,z) = 0
TF(corporation,x) = 1 
TF(corporation,y) = 1 
TF(corporation,z) = 0 
TF(usa,x) = 1
TF(usa,y) = 1
TF(usa,z) = 0
TF(ibm,x) = 0
TF(ibm,y) = 1
TF(ibm,z) = 0
TF(aple,x) = 0
TF(aple,y) = 0
TF(aple,z) = 1
TF(corp,x) = 0
TF(corp,y) = 0
TF(corp,z) = 1

Inverse Document Frequency

IDF(apple) = 3/1 (one of three documents contains apple) = 3
IDF(corporation) = 3/2 = 1.5
IDF(usa) = 3/2 = 1.5
IDF(ibm) = 3/1 = 3
IDF(aple) = 3/1 = 3
IDF(corp) = 3/1 = 3

Feature vectors (Feature = TF*IDF)

softIDF-unnorm.PNG

Let’s normalise these vectors: remember from trigonometry, it means getting the unit vector of length 1: divide the coordinates by the length of the vector.

length vector document x = sqrt((3*3) + (1.5*1.5) + (1.5*1.5)) = 3.67
length y = sqrt((1.5*1.5) + (1.5*1.5) + (3*3)) = 3.67
length z = sqrt((3*3) + (3*3)) = 4.24

Normalised vectors for documents (each coordinate divided by vector’s length):

softIDF-norm.PNG

Now Needleman-Wunsch scores come into a game.

We compute the close terms.

Close(x,y,0.55) = {(apple is not because its corporation in Document y is similar by 0.55 but there is another word – corporation which bonds by 1 and thus blocks apple’s similarity), corporation (because it has strongest bond with word corporation in document y), usa (because it has strongest bond to usa in document y)}

SoftIDFxy.png

  • close(x,y,0.55) = {corporation,usa}
  • close(x,z,0.55) = {apple}
  • close(y,z,0.55) = {} (noone pair passed threshold)

SoftIDFyz.png

In the final step we compute features but giving a weight to each component to the TF/IDF formula. We are looking for the most closest vectors. The more narrow the angle is, the larger is its cosine. Thus we have to calculate the cosine of the angle between vectors and pick the largest one. As our vectors are normalised, cosine formula now is simple computing the dot (scalar) product.

  • similarity(x,y) = x corporation coordinate 0.41 * y corporation coordinate 0.41 * Needleman-Wunsch similarity weight 1 + x usa coordinate 0.41 * y usa coordinate 0.41 * 1 = 0.34 = 34%
  • similarity(x,z) = x apple 0.82 * z aple 0.71 * 0.8 = 0.46 = 46%
  • similarity(y,z) = 0% (remember, that words pairs, incl. corp and corporation did not pass our threshhold)

Voila! By Needleman-Wunsch the most similar strings are

  • Apple Corporation, USA
  • Corp. Aple

Now let’s recalculate soft TF/IDF using Jaro-Winkler.

Threshold k = 0.45 for Jaro-Winkler (I have set different from Needleman-Wunsch just for more fun to learn differences better).

See in pictures how do we get to the closest results by keeping the strongest bonds:

SoftIDFJaroxy.png

SoftIDFJaroxz.jpg

SoftIDFJaroyz.jpg

  • close(x,y,0.45) = {corporation,usa}
  • close(x,z,0.45) = {apple,corporation}
  • close(y,z,0.45) = {corporation}

Now let’s calculate Features from normalized vectors (the same vectors we calculated)

softIDF-norm

As these vectors are normalised, I’ll remind cosine formula is simple computing the dot (scalar) product.

  • similarity(x,y) = x corporation coordinate 0.41 * y corporation coordinate 0.41 * Jaro-Winkler similarity weight 1 + x usa coordinate 0.41 * y usa coordinate 0.41 * 1 = 0.34 = 34%
  • similarity(x,z) = x apple 0.82 * z aple 0.71 * 0.93 + x corporation 0.41 * z corp 0.71 * 0.87 = 0.79 = 79%
  • similarity(y,z) = y corporation 0.41 * z corp 0.71 * 0.87 = 0.25 = 25%

Voila! By Jaro-Winkler the most similar strings again are

  • Apple Corporation, USA
  • Corp. Aple

Do you feel the power of the idea of this hybrid (string & set combined) similarity calculation method?

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: combining string and set matching methods. One of hybrid similarity measures – generalised Jaccard index


It’s time to start to combine string and set matching methods – let’s have a look at one of Hybrid Similarity measures.

Classic Jaccard measure considers overlapping tokens (words, q-grams). To be considered as overlapped, the token must be identical. Jaccard index works very well when names are in mixed order like “Food and drinks”, “Drinks&Food”, “Serving foods, sweets, drinks”. However pure Jaccard is too restrictive when text contains errors.

I had an example in my Jaccard blog entry comparing classes by pupils’ names – Martins, Marta, Katrina, Ance etc. What if some of names were written with errors?

Generalized Jaccard measure helps.

First of all, we as usual convert the comparable string into tokens. I’ll reuse the example and put some errors in names

Class one pupils x = {Kartina,Janis,Karlis,Martins,Anna,Karlina}

Class two pupils y = {Annija,Martins,Matra,Karloina,Ance}

Today we’ll learn also soft overlap – using the most matching pairs.

First step. compare each pair

To compare we need a similarity measure s which returns values 0..1 (the closer to 1 the more similar).

For more fun let’s apply two for the same pairs- Edit distance (Levenshtein distance) and Jaro-Winkler measure – see, the result differs? :) I used https://asecuritysite.com/forensics/simstring (sorry, this page has a bug in Jaro-Winkler algorithm – because it is not true (janis,martins) has 0 by JW (it should be 0.67) – but I could not find any other online calculator and for our experiment this bug is acceptable and let’s use it as an example how easy is to misuse method when we simply believe to somebody’s programmed result without understanding)

Sim_scores.PNG

Second step.

Choose threshold and keep only those who exceed. I have chosen threshold 0.5.

Threshold for Edit Distance

Threshold_edit_distance.PNG

Threshold for Jaro-Winkler

Threshold_Jaro_Winkler.PNG

Third step. Keep only the strongest bond.

To find that I draw all the upper threshold bonds at first.

Edit_distance_bonds

Martins and Martins are of course bonded. It means no any other bonds possible from or to.

Karlina to Karloina has the next strongest remaining bond. Again, no other bonds from/to.

We have left Anna to Annija because all other bonds relate to “engaged” entities. For examle, Kartina cannot bond to Karloina with 0.71 because Karlina bonded with 0.88.

We calculate then weight of matching by adding all the match scores (2.55) and divide by the (all name in class X plus all names in class B minus matchinmg pairs) = 0.319 = 32% similarity when we hybridise Jaccard with Edit Distance.

Edit_distance_bond.jpg

Now let’s do the same for Jaro-Winkler. First of all, all bonds upper than threshold:

Jaro-bonds.jpg

and keep only the strongest bonds. again you see, for example, Kartina cannot bond 0.88 with Martins because Martins bonded to Martins with 1. Kartina cannot also bond with 0.91 to Karloina because Karlina bonded to Karloina with 0.98.

And formula again – matching weight divided by sum of entities minues mathing pairs and – voila! = 0.707 = 71% similarity when we hybridise Jaccard with Jaro-Winkler measure.

Jaro_bond

I’ll remind that in my previous blog entry explaining Jaccard measure I showed that

  • similarity with correctly spelled names and requirement for name equality was 10%
  • similarity using bigrams was 50%
  • similarity with trigrams was 26%

Today we calculated (on a slightly modified set – errors added)

  • similarity with hybrid method with Edit distance was 32%
  • similarity with hybrid method with Jaro-Winkler measure was 71%

Isn’t it funny – five different results? There is no silver bullet in string and set matching. Try and test, try and test, try and test… repeat.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: set similarity : q-grams, Overlap measure, Jaccard index, Jaccard distance


Today’s topic: is John Smith similar to Smith John? As you see, Edit distance 10 will measure as not similar, so as Needleman-Wunsch (score = – 5). Smith Waterman would just point to the two similar substrings. Hmm, are these strings really not similar?

I asked my kid: how could we find guys of which class have the most similar names to your class? He answered: well, we could take name by name and calculate Edit distance score for each and then sum results together?

Let’s have a look closer on some official methods.

Class one pupils {Katrina,Janis,Karlis,Martins,Anna,Karlina}

Class two pupils {Annija,Martins,Marta,Karolina,Ance}

First of all, let’s learn to obtain the set of comparable items – tokens.

Let’s start having names as our tokens. Set similarity operates with tokens, not “words” because words sound like are something meaningful for us. Tokens are… just tokens :) anything we have agree how to split the set – you’ll see several approaches.

Simple: if we use word a s token then string “David Alexander Smith” consists of three tokens: david, alexander, smith.

A bit more complex: we might agree that we during tokenisation remove common stop words like “of, the, and” or ” Mr, Ms, fon, da” etc.

Then string “Mr. David Alexander fon Smith” again consists of three tokens: david, alexander, smith.

q-grams or n-grams

This impressive name just means that we split the string in substrings of length q. I’ll use David and show you two approaches used – which approach you choose  later depends on your needs and further usage of results.

q=1 (unigram):

  • {D,a,v,i,d}

q=2 (bigram, sometimes called digram):

  • first approach: {#d,da,av,vi,id,d#} (I used the special character # to pad for the length)
  • second approach: {da,av,vi,id}

q=3 (trigram):

  • {##d,#da,dav,avi,vid,id#,d##}
  • {dav,avi,vid}

q=4 (four-gram):

  • {###d,##da,#dav,davi,avid,vid#,id##,d###}
  • {davi,avid}

q=5 (five-gram):

  • {####d,###da,##dav,#davi,david,#avid,##vid,###id,####d}
  • {david}

I used R for some tests, see it does the second approach:

R_qgrams

Similar strings have many common q-grams.

You might find it interesting that translation software most likely detects language by comparing q-grams to the statistical results of languages. This my blog entry could be split in bigrams and would have result “English” returned.

Q-gramming may be applied to word level also.

The quick brown fox jumps over a lazy dog.

bigram:

  • {# the,the quick, quick brown,brown fox, fox jumps,jumps over,over a,a lazy,lazy dog,dog #}
  • {the quick, quick brown,brown fox, fox jumps,jumps over,over a,a lazy,lazy dog}

trigram

  • {# # the, # the quick,the quick brown,quick brown fox,brown fox jumps,fox jumps over,jumps over a,over a lazy,a lazy dog,lazy dog #,dog # #}

etc.

Note it is a analyst’s (my, yours) decision if the method used requires some data cleaning like removal of punctuation, whitespaces, letter de-capitalising etc.

The overlap measure

Let’s do some simply examples for learning the idea: string 1 will be Carlo, string 2 Carol.

If I use bigrams for tokenisation.

Let S1 = set of tokens of string 1.

  • {#c,ca,ar,rl,lo,o#}

Let S2 = set of tokens of string 2

  • {#c,ca,ar,ro,ol,l#}

Overlap(S1,S2), O(S1,S2) = number of common tokens.

In our case common tokens are #c,ca,ar

O(S1,S2) = 3

If I use trigrams for tokenisation.

  • {##c,#ca,car,arl,rlo,lo#,o##}
  • {##c,#ca,car,aro,rol,ol#,l##}

O(S1,S2) = 3

The Jaccard index a.k.a coefficient a.k.a measure

Well, you saw the Overlap measure. What is it? It actually has no value itself without context. We need to make it usable. Let’s have an index always 0..1 – the more to 1, the higher is set similarity. I’ll also multiply by 100 to have percents.

Jaccard(S1,S2) = count of common tokens  in strings S1,S2 / count of all unique tokens in both S1,S2

Or rephrasing a bit more scientific:

Jaccard(S1,S2) = intersect of tokens S1,S2 / union of tokens S1,S2

Note: ‘union’, not ‘union all’. Union are all unique, while union all are all.

Jaccard(S1,S2) = overlap of tokens / total count of unique tokens

Knockout now with The Formula :)

Jaccard

String 1 Carlo, string 2 Carol.

If I use bigrams for tokenisation:

  • S1 = {#c,ca,ar,rl,lo,o#}
  • S2 = {#c,ca,ar,ro,ol,l#}
  • Overlap(S1,S2) = 3
  • total count of unique tokens = 9

Jaccard(S1,S2) = 3/9 = 0.33 = 33%

Visualisation from http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html

Jaccard_online

If I use trigrams for tokenisation:

  • S1 = {##c,#ca,car,arl,rlo,lo#,o##}
  • S2 = {##c,#ca,car,aro,rol,ol#,l##}
  • Overlap(S1,S2) = 3
  • total count of unique tokens = 11

Jaccard(S1,S2) = 3/11 = 0.27 = 27%

Interpreting the result:

  • Two sets having all tokens as common will be 100% similar. The closer to 100%, the more similarity (e.g. 90% is more similar than 89%).
  • If they have no common tokens, they are 0% similar.
  • The half –  50% – means that the two sets have common half of the members.

The Jaccard distance

Measure of how different or dissimilar two sets are. It is the complement of the Jaccard index and can be found by subtracting the Jaccard index/measure/coefficient from 1 (or percentage from 100%).

For the above example with bigrams, the Jaccard distance is 1 – 33% = 67%.

Trigrams have Jaccards distance = 100% – 25% = 75%

So what?

Here you might ask: hey, but you showed they are not usable at all! Carlo IS similar to Carol!

Let me remind I was showing the idea on a very simplified example and we are talking about set similarity, not string similarity.

To treat as a string, let’s use Jaccard for Carlo and Carol having q=1

  • {c,a,r,l,o}
  • {c,a,r,o,l}

Voila! Jaccard measure = 1, Jaccard distance = 0%.

Because THE SETS OF SYMBOLS are equal!

Jaccard_online2.PNG

Now back to sets and names in classes.

Tokenisation method 1 – names are tokens:

S1={Katrina,Janis,Karlis,Martins,Anna,Karlina}

S2={Annija,Martins,Marta,Karolina,Ance}

O = 1 (Martins)

Total count of unique tokens = 10

Jaccard measure = 1/10 = 0.1 = 10%

Jaccard distance = 90% (remember: not similarity, but dissimilarity)

Conclusion: by method 1 names in these classes are not similar.

And now let’s calculate the same using bigrams :)

Tokenisation method 2 – bigrams are tokens:

I used R for splitting to bigrams

R-trigrams.PNG

and Excel for visualisation

Bigrams.PNG

Jaccard measure = 15/30 = 0.5 = 50%

Jaccard distance = 100% – 50% = 50%

Conclusion: by method 2 names in these classes are half similar.

Tokenisation method 3 – trigrams are tokens:

Trigrams.PNG

Jaccard measure = 11/43 = 0.26 = 26%

Jaccard distance = 100% – 26% = 74%

Conclusion: by method 3 names in these classes are quite a little similar.

Have you got the idea? Now you can apply the same method on different classes and have a reliable method to measure similarity.

Which one to choose?

The method which suits YOU the best. I can’t predict, will you consider Marta and Martins similar or dissimilar in YOUR business case.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: string similarity: dealing with typos (Jaro meausre, ups, measure)


Have you evre had a typo? How to instruct the computer that Luxmeb is most likely meant Luxembourg? (try to write in google and, see, it knows! :) let’s have look – how.

For this type of short strings like words Jaro measrue, ups, Jaro measure or Jaro distance or Jaro score helps.

The higher the Jaro distance is for two words is, the more similar these words are. The score 0 means no similarity and 1 means an exact match. Sometimes result is used in percents, I’ll show both.

The Jaro measure idea is assumption that

if one word contains near the same letters as the other and these letters can quite easily be reordered

to match – like length and lenght – or teh/the, dacnign/dancing

–  then words are similar.

Values used in the calculation of the Jaro Distance:

  1. Length of words being compared
  2. Count of common characters
  3. Count of transpose (exchange) operations needed to reorder common characters to equal order

Official formula looks complex, like scientists usually do to impress us:

Jaro1

(source: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)

It is really easy, let’s start with Jon and John.

  • s1 = Jon length 3
  • s2 = John length 4
  • m = 3 (three common characters J, o, n
  • t = 0 (no symbol transpositions needed Jon to John, because matching symbols are already in the same order)

jaro(Jon,John) = 1 / 3 x (3 matching sumbols /3 length of Jon + 3 matching symbols / 4 length of John + ((3 matching symbols – 0 transpositions)/3 matching symbols) = 1 / 3 x (3/3 + 3/4 + 3/3) = 1/3 x (1+ 0.75 + 1) = 1/3 x 2.75 = 0.917 (quite similar) or 92%

Edit distance of Jon, John = 1. Levenstein Distance Measure is  1 – d(x,y) / [max(length(x), length(y))] : s(Jon,John) = 1 – 1 / [max(3, 4)] = 1 – 1/4 = 0.75 or 75% similarity

You see, Jaro score 92% is better than Levenshtein’s 75%.

and also better than

  • Needleman-Wunsch 75% (scoring 1;-1;-1) (three matches 1×3 + one gap penalty -1 = 2)
  • Smith-Waterman 83% (Jo and Jo, n and n are the best matching substrings)

(calculator to play – https://asecuritysite.com/forensics/simstring)

Let’s calculate Jaro now for Jon and Ojhn

  • s1 = Jon length 3
  • s2 = Ojhn length 4
  • m = 3 (three common characters J, o, n)
  • t = 1 (transpositions needed to reorder common symbols (ojn – jon)

Jaro = 1 / 3 x (3/3 + 3/4 + (3-1)/3) = 0.81 or 81% similarity

Jaro for Jon and Ohnj (I reordered symbols)

  • s1 = Jon length 3
  • s2 = Ohnj length 4
  • m = 3 (three common characters J, o, n)
  • t = 2 (transpositions needed to reorder common symbols (onj – ojn – jon)

Jaro = 1 / 3 x (3/3 + 3/4 + (3-2)/3) = 0.70 or 70% similarity

Levenstein distance for JON, OHNJ is 3, measure is 1 – 3/4 = 0.25 or 25% similarity. See the difference?

Jaro for Luxmeb and Luxembourg

  • s1 = Luxmeb length 6
  • s2 = Luxembourg length 10
  • m = 6 (seven common characters L,u,x,e,m,b)
  • t = 1 (transpositions needed to reorder common symbols:  Luxmeb – Luxemb

Jaro = 1 / 3 x (6/6 + 6/10 + (6-1)/6) = 0.81 is 81% similarity

Edit distance result is 1 – 6 / [max(6,10)] = 1 – 6/10 = 0.4 is 40% similarity.

A variation to handle prefixes is Jaro – Winkler measure.

In real life this is a very common situation the strings have the same beginning (I bleieve you all konw tihs jkoe). Like beginnign, Luxmebogr, lenhgt, meausre. To take that into account and improve scoring, there is a measure

Jaro2.PNG

  • dw is Jaro-Winkler measure result
  • dj is Jaro distance – like we calculated before
  • l is the length of longest common exact matching prefix (often referred as max possible 4)
  • p is the weight given to the prefix (how important it is for us) (default use to be 0.1)

Another way to express the same formula is

  • jaro-winkler(x,y) = (1 – PL*PW)*jaro(x,y) + PL*PW

where

  • PL = length of the longest common prefix
  • PW is a weight given to the prefix

Let’s calculate Jaro-Winkler for Luxmebogr and Luxembourg.

  • I’ll assume that prefix weight is recomended 0.1
  • Jaro distance (s1=9, s2=10, m=9, t=1) = 0.91 (91%)
  • Longest common exactly matching prefix is Lux

Jaro-Winkler distance is

  • 0.91 + (3×0.1x(1-0.91)) = 0.94 or 94% similarity (first formula)
  • (1-3×0.1)x0.91 + 3×0.1 = 0.94 or 94% similarity (second formula – of course, the same because formulas are equal, just written in another way)

Jaro-Winkler for Luxmeb and Luxembourg

Jaro was 0.81 or 81% similarity

0.81 + (3×0.1x(1-0.81)) = 0.87 or 87% similarity

Can’t believe it is so easy, can you?

There are many more string similarity measures and algorithms in the world, different time elapsed, different precision and a lot of adaptions for specialized cases like nucleotides.

However, if you have read all my string similarity posts, you should have a good foundation of the basic ones.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: string similarity: best matching substrings between two strings (Smith-Waterman algorithm)


Previous methods I wrote about considered matching all characters to all characters between two strings (when entire string is aligned, it is called global alignment). As you see from our new challenge, this is not suitable in some cases:

can we somehow find out that Mr. Alexander Fitzgerald, University of Canterbury is similar to Professor, Dr.sc.comp Alexander Williams Fitzgerald from Canterbury, New Zealand?

Edit distance = 54 (https://planetcalc.com/1721/) – very long.

Needleman-Wunsch: -32 (http://bioinformaticnotes.com/Needle-Water/) similarity quite low.

Better idea: find two substrings of given strings that are the most similar one to another

(when parts of strings are compared, it is called local alignment). We can already guess the best matching one will be ‘Fitzgerald’ but computer must be provided with logic. Method I am blogging about was first proposed by Temple F. Smith and Michael S. Waterman in 1981.

The main difference to the Needleman–Wunsch algorithm is that negative scoring matrix cells are reset to zero, thus now a match can be restarted at any position, not only the very corner. It means if you learn to calculate Needleman-Wunsch, it is easy to adapt to Smith-Waterman. Just stay positive :)

The idea is simple, yet brilliant:

  1. add one more condition to Needleman – Wunsch: the lowest possible value is 0, so the result will never be negative:

Smith-Waterman4.PNG

2.  when bactracking for the alignment, just find the highest value and go up until you reach 0. Repeat searching for the next highest as many times you want.

  • Best match is AWGHE to AW_HE
  • Next best: HEA to HEA

Smith-Waterman5.PNG

(source http://docencia.ac.upc.edu/master/AMPP/slides/ampp_sw_presentation.pdf)

Let’s look on a sample trying to fit the picture in the screen:

  • PROFESSOR WILLIAM SMITH DR.SC.COMP CANTERBURY
  • DUDE WILL SMITH NEIGHBOUR

I believe, YOU see that best matching substrings are WILL and SMITH. But computer don’t have the brains as you have. By using Smith-Waterman measure, the computer can only calculate it by predefined algorithm.

You probably cannot see there but in this picture SMITH to SMITH max result in the bottom-right is 18. WILL to WILL is 10.

Smith-Waterman.PNG

(source: http://bioinformaticnotes.com/Needle-Water/)

Here comes the full table of original example – ALEXANDER, FITZGERALD, CANTERBURY:

Smith-Waterman2.PNG

Let’s add some foreign language for fun – you’ll see that also Smith  and Waterman were not magicians, however we can do a good guess with their algorithm:

  • DR.ALEKSANDRS VILJAMS FICDŽERALDS KENTERBERIJA
  • MR. ALEXANDER FITZGERALD CANTERBURY

and we see most similar substrings:

  • KENTERBERI-CANTERBURY
  • ERALDSKENTERB-ERALD_CANTERB
  • FI-FI
  • ALEKSANDRS VI – ALEX_ANDER FI

Smith-Waterman3

When you google, you’ll see a lot of bioinformatics and related samples TGCTACCGTAA….. I prefer a bit more human readable examples.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: string similarity: mind the Gap. The Affine Gap.


In my previous blog posts we have learned that

  • Crts is similar to Cats – so our autocorrect software might choose this word from vocabulary to correct Crts (Levenstein distance)
  • Davod1ee to DavidLee – so we could do good guess that this was a mention of our professor David Lee (Needleman-Wunsch measure)

Today’s topic is: how to instruct the computer that Alex Jso is similar to Alexander Johansson?

Let’s start with looking why classics is not enough:

  • EDIT DISTANCE (Levenshtein distance) = 16 (mammamia! Who said they are similar?)

Affine-0.PNG

(http://www.convertforfree.com/levenshtein-distance-calculator/)

  • Needleman-Wunsch measure = -6 (scoring match = 2; mismatch = -2; gap = -2) – mnja. PICADILLA-CROCODILE has better NW measure -4, are they really more likely to be similar?

Affine-1.PNG

(http://experiments.mostafa.io/public/needleman-wunsch/)

The Affine Gap measure solves this problem – continuing the gap has less penalty than opening a gap, and we can even modify that the longer gap is continued, the lesser penalty becomes

Exactly what we need in our ALEX – ALEXANDER and JSO – JOHANSSON case. This method is an extension to Needleman-Wunsch measure for handling longer gaps.

In Needleman-Wunsch there were match, mismatch and gap penalty scores involved, and each symbol gap was penalised equally and summary cost of any gaps was too high:

ALEX to ALEXANDER is 5 x gap penalty = 5 x -2 = penalty -10, J to JOHAN is  4 x -2 = penalty -8 for this gap, and two more penalties for other gaps S to SS, O to ON, so the total penalty for all gaps was 11 x -2 = -22!

In real life this is a very common situation data have gaps longer than one character, and affine gap algorithm distinguishes

  • cost of opening the gap of any length (applied to the first character of gap, like A)
  • cost of continuing the gap when there is more than one _ (empty space) in a row (applied to remaining characters of a gap like NDER)

This algorithm is called extension because it keeps the match and mismatch scoring system idea from Needleman-Wunsch, and adds sophisticated rules for gaps.

NB: the affine gap algorithm result will consider this alignment better

Affine-2

than

Affine-3.PNG

because second option opens more gaps (remember, opening a gap has higher penalty)).

For example:

  • gap open score = -3
  • gap extension score = -1

ALEX to ALEXANDER is 1 x gap opening penalty + 4 x gap continuing penalty = 1 x -3 + 4 x -1 = -7, J to JOHAN is  1 x -3 + 3 x -1 = penalty -6 for this gap, and one more penalty for SO to SON -3, so the total penalty for all gaps was -7 + -6 + -3 = -16 (see, even with greater score for openo=ing a gap it is better than -22 when all the gaps were penalised equally).

If we’d use the score -2 for opening and -1 for continuing, we even have -13, near twice as good as -22.

This topic is hard to explain in details – better try understand the business behind and. when needed, just follow formulas :) If I have done the full calculations, I would have for more fun considered scoring where

  • mismatch (changing one symbol to other) costs by default 2
  • mismatch – changing L to E or E to L costs 1 (less because they look similar – I could add any score for any pair of symbols, as well to bonus O and Q, as to punish Q and W)
  • match of symbols costs 5 (it is most preferrable situation for me).

Have a look on formulas used – there are three matrices involved:

Affine-formula

(source: http://pages.cs.wisc.edu/~bsettles/ibs08/lectures/02-alignment.pdf)

Affine-formula2.PNG

(source: https://www.cs.umd.edu/class/fall2010/cmsc423/lectures/Lec04-gaps.pdf)

Affine-formula3.PNG

(source: https://users.info.uvt.ro/~dzaharie/bioinfo2016/proiecte/biblio/MultipleAlignment%2BDP/MultipleAlignment%2BDP.pdf)

When you understand the business behind, the formulas will not seem so complex: a lot combinations possible when we differ gaps starting and extending costs.

a symbol can match a symbol in other string and do not start or close any gaps

A

A

a symbol can mismatch a symbol in the other string however do not start a gap in the other string

because of scoring system used (like we can set that O is the same as 0 or L mismatch to E is not as good as match, but at the same time better than starting a gap)

AO

A0

a symbol in string can start a gap in the other string

AB

A

a symbol in string can continue a gap in the other string

ABC

A – 

a symbol in string can close a gap in this string and do not start a gap the other string

Q – – D

QABD

a symbol in string can close a gap in this string and can start a gap in the other string

– – M

AB – 

NB that we are not considering gap to gap, thus it is not possible situation like

symbol P closes the gap in one string and continues gap in the other – because gap to gap is not possible

– – P

– – –

or “symbol K continues gap in both strings (symbol cannot be both symbol and a gap)”.

Some paragraphs above I wrote that affine gap prefers SSON to -SO-, not to S-O- – this is done by performing calculations always for both strings and both operations:

we calculate not only the score of opening a gap, but also score of ending the gap.

Then, when backtracing the optimal path, we walk through SSON to -SO- path.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

%d bloggers like this: