Big Data: Phonetic Similarity : Soundex – words are similar if they sound the same


I guess you have seen surnames like Meier and Mayer or Smith, Smyth, Smithe, Smeth, Smeeth. These might be as well correct as misspelt surnames – if you’d dictate them to me by phone, who knows what I’d write.

So far I was blogging about similarity algorithms based on string writing. Today let’s discuss finding a match if the words sound the same. The most commonly used phonetic measure is Soundex. It has several extensions and my today’s topic is

Soundex for English language.

Soundex calculates a four character code from a word  based upon the pronunciation and considers two words as similar if their codes are equal. The idea is that similar sounding letters have are assigned the same soundex code. Widely used in genealogy, in archives, searching ancestors, relatives, families, heirs.

  1. The first character is the starting letter of a word. (in a variation called “Reverse Soundex” prefixes the last letter instead of the first)
  2. Drop all other occurrences of a, e, i, o, u, y, h, w.
  3. Replace consonants after the first letter with digits as follows:
    • b, f, p, v → 1
    • c, g, j, k, q, s, x, z → 2
    • d, t → 3
    • l → 4
    • m, n → 5
    • r → 6
    • If two or more letters with the same number were adjacent in the original name (before step 2), or adjacent except for any intervening h and w, then omit all but the first.
    • Return the first four padded with 0 (padding means replace blanks with 0, like ‘Ahoi’ will have code A000, ‘Dude’ will have D300 – always four characters code).

Let’s have an example set – surnames. I used Oracle RDBMS this time.

Soundex_table.PNG

Now let’s compare similarities by three methods: Edit distance, Jaro-WInkler, Soundex.

Soundex_select.PNG

and here are the results. Notice the combinations we have: if I set a similarity threshold by Edit distance or Jaro-Winkler to 50% then we have several combinations. including false positives and false negatives:

  • all three methods match – like ‘Mirhe’
  • Jaro-Winkler and soundex match, but Edit distance doesn’t – like ‘Meiyar’
  • Jaro-Winkler match but Soundex doesn’t – like ‘Mayes’
  • Edit distance and Jaro-Winkler match but Soundex doesn’t – like ‘Mimre’ or ‘Mirfe’

Soundex_results.PNG

You see, Soundex is not a silver bullet and, as I have always been writing, we must try and test, test ad try.

I’ll show you one more weakness of Soundex:

Scwarz.PNG

From the three approaches I used Soundex is the only one which did not find similarity :)

Scwarz_res

Some of Soundex variants

  • The New York State Identification and Intelligence System (NYSIIS) algorithm maintains relative vowel positioning, while Soundex does not.
  • Daitch–Mokotoff Soundex (D–M Soundex) adaptation to Jews with Germanic or Slavic surnames, sometimes referred as “Eastern European Soundex”. Results of D-M Soundex are returned in an all-numeric format between 100000 and 999999, calculation is much more complex than Soundex.
  • Metaphone, Double Metaphone, Metaphone 3. Powerful and customisable rule set, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations.

I googled online Metaphone calculator, they say It’s more accurate than soundex – hardly can agree:

  • The metaphone code for Schwarzenegger is SXWRSNKR.
  • The metaphone code for Schvartzeneger is SXFRTSNJR.
  • These surnames do not have the same metaphone code.

Then I tried for one of my Soundex similarities and – again

  • The metaphone code for Meiyar is MYR.
  • The metaphone code for Mire is MR.
  • These surnames do not have the same metaphone code.

I was also searching for Soundex Latvian edition – I am quite sure it exists. I found this: http://www.lzp.gov.lv/images/stories/dokumenti/Zin_rezult_2008.pdf

2008. g. izstrādāts un pilveidots elastīgs universālas leksikona sistēmas datubāzes
modelis, kas paredz vienotas infrastruktūras (kopīgu indeksēšanas un atgriezeniskās
saites mehānismu u.c.) un funkcionalitātes (šķirkļu izvērstas meklēšanas un
konfigurējamas atainošanas u.c.) pieejamību visām datubāzē izvietotajām vārdnīcām
neatkarīgi no to šķirkļu shēmām. Attiecībā uz indeksēšanu un meklēšanu, latviešu
valodai tika pielāgots Soundex algoritms, lai nodrošinātu neprecīzi ievadītu, bet pēc
izrunas līdzīgu vārdu atrašanu. (A. Spektors)

P.s. Tiem, kas lasa arī latviešu valodā – šeit ir maziņš un mīlīgs foruma ieraksts, kā cilvēks cenšas izveidot meklēšanas ieteikumu rīku (“vai jūs domājāt XXYZZX?”)  https://exs.lv/say/16261/1441002-so-nakti-pavadiju-veidojot

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Advertisements

Big Data: hybrid similarity measure: the Soft TF/IDF Measure to deal with misspelt words


Some days ago I covered the topic about finding which two of these strings are most likely about the same real world entity – by recognizing requently used words and assigning them lower impact, thus first and third options were found as most similar:

  • Apple Corporation, USA
  • IBM (USA) Corporation
  • Corp. Apple

Let’s add a challenge: misspell the word

  • Apple Corporation, USA
  • IBM (USA) Corporation
  • Corp. Aple

And, as you might imagine, TF/IDF measure is not effective anymore because it cannot recognize Apple is similar to Aple as in classic implementation is looking for equality, not similarity. Today, similar like we did in generalised Jaccard, we will replace the requirement for words to be equal with requirement to be similar.

As I did before with TF/IDF, let’s remove commas, brackets and tokenize our collection into a bag of terms – documents x,y,z.

  • x = {apple,corporation,usa}
  • y = {ibm,usa,corporation}
  • z = {aple,corp}

Full term set T = {apple,corporation,usa,ibm,aple,corp}

Now we will use any string similarity measure of our choice (and business need). I have chosen today Needleman-Wunsch and Jaro-Winkler to illustrate the differences, eg, s(apple,aple) = 0.8 by Needleman-Wunsch and s(apple,aple) = 0.93 by Jaro Winkler. I used calculator https://asecuritysite.com/forensics/simstring

Let’s choose similarity (it is my choice, it could be any other value 0..1)

  • threshold k = 0.55 for Needleman-Wunsch measure
  • threshold k = 0.45 for Jaro-Winkler measure

soft_idf_scores.PNG

Now we will reveal these terms which have similar term in the other document.

Soft TF/IDF based on Needleman-Wunsch

Initial steps are the same as per classic TF/IDEf. Calculate:

Frequency (TF)

is the number of times each term appears in each document.

Full term set T = {apple,corporation,usa,ibm,aple,corp} (as you see we will have six dimension vectors)

TF(apple,x) = 1 (document x has one time apple)
TF(apple,y) = 0
TF(apple,z) = 0
TF(corporation,x) = 1 
TF(corporation,y) = 1 
TF(corporation,z) = 0 
TF(usa,x) = 1
TF(usa,y) = 1
TF(usa,z) = 0
TF(ibm,x) = 0
TF(ibm,y) = 1
TF(ibm,z) = 0
TF(aple,x) = 0
TF(aple,y) = 0
TF(aple,z) = 1
TF(corp,x) = 0
TF(corp,y) = 0
TF(corp,z) = 1

Inverse Document Frequency

IDF(apple) = 3/1 (one of three documents contains apple) = 3
IDF(corporation) = 3/2 = 1.5
IDF(usa) = 3/2 = 1.5
IDF(ibm) = 3/1 = 3
IDF(aple) = 3/1 = 3
IDF(corp) = 3/1 = 3

Feature vectors (Feature = TF*IDF)

softIDF-unnorm.PNG

Let’s normalise these vectors: remember from trigonometry, it means getting the unit vector of length 1: divide the coordinates by the length of the vector.

length vector document x = sqrt((3*3) + (1.5*1.5) + (1.5*1.5)) = 3.67
length y = sqrt((1.5*1.5) + (1.5*1.5) + (3*3)) = 3.67
length z = sqrt((3*3) + (3*3)) = 4.24

Normalised vectors for documents (each coordinate divided by vector’s length):

softIDF-norm.PNG

Now Needleman-Wunsch scores come into a game.

We compute the close terms.

Close(x,y,0.55) = {(apple is not because its corporation in Document y is similar by 0.55 but there is another word – corporation which bonds by 1 and thus blocks apple’s similarity), corporation (because it has strongest bond with word corporation in document y), usa (because it has strongest bond to usa in document y)}

SoftIDFxy.png

  • close(x,y,0.55) = {corporation,usa}
  • close(x,z,0.55) = {apple}
  • close(y,z,0.55) = {} (noone pair passed threshold)

SoftIDFyz.png

In the final step we compute features but giving a weight to each component to the TF/IDF formula. We are looking for the most closest vectors. The more narrow the angle is, the larger is its cosine. Thus we have to calculate the cosine of the angle between vectors and pick the largest one. As our vectors are normalised, cosine formula now is simple computing the dot (scalar) product.

  • similarity(x,y) = x corporation coordinate 0.41 * y corporation coordinate 0.41 * Needleman-Wunsch similarity weight 1 + x usa coordinate 0.41 * y usa coordinate 0.41 * 1 = 0.34 = 34%
  • similarity(x,z) = x apple 0.82 * z aple 0.71 * 0.8 = 0.46 = 46%
  • similarity(y,z) = 0% (remember, that words pairs, incl. corp and corporation did not pass our threshhold)

Voila! By Needleman-Wunsch the most similar strings are

  • Apple Corporation, USA
  • Corp. Aple

Now let’s recalculate soft TF/IDF using Jaro-Winkler.

Threshold k = 0.45 for Jaro-Winkler (I have set different from Needleman-Wunsch just for more fun to learn differences better).

See in pictures how do we get to the closest results by keeping the strongest bonds:

SoftIDFJaroxy.png

SoftIDFJaroxz.jpg

SoftIDFJaroyz.jpg

  • close(x,y,0.45) = {corporation,usa}
  • close(x,z,0.45) = {apple,corporation}
  • close(y,z,0.45) = {corporation}

Now let’s calculate Features from normalized vectors (the same vectors we calculated)

softIDF-norm

As these vectors are normalised, I’ll remind cosine formula is simple computing the dot (scalar) product.

  • similarity(x,y) = x corporation coordinate 0.41 * y corporation coordinate 0.41 * Jaro-Winkler similarity weight 1 + x usa coordinate 0.41 * y usa coordinate 0.41 * 1 = 0.34 = 34%
  • similarity(x,z) = x apple 0.82 * z aple 0.71 * 0.93 + x corporation 0.41 * z corp 0.71 * 0.87 = 0.79 = 79%
  • similarity(y,z) = y corporation 0.41 * z corp 0.71 * 0.87 = 0.25 = 25%

Voila! By Jaro-Winkler the most similar strings again are

  • Apple Corporation, USA
  • Corp. Aple

Do you feel the power of the idea of this hybrid (string & set combined) similarity calculation method?

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: combining string and set matching methods. One of hybrid similarity measures – generalised Jaccard index


It’s time to start to combine string and set matching methods – let’s have a look at one of Hybrid Similarity measures.

Classic Jaccard measure considers overlapping tokens (words, q-grams). To be considered as overlapped, the token must be identical. Jaccard index works very well when names are in mixed order like “Food and drinks”, “Drinks&Food”, “Serving foods, sweets, drinks”. However pure Jaccard is too restrictive when text contains errors.

I had an example in my Jaccard blog entry comparing classes by pupils’ names – Martins, Marta, Katrina, Ance etc. What if some of names were written with errors?

Generalized Jaccard measure helps.

First of all, we as usual convert the comparable string into tokens. I’ll reuse the example and put some errors in names

Class one pupils x = {Kartina,Janis,Karlis,Martins,Anna,Karlina}

Class two pupils y = {Annija,Martins,Matra,Karloina,Ance}

Today we’ll learn also soft overlap – using the most matching pairs.

First step. compare each pair

To compare we need a similarity measure s which returns values 0..1 (the closer to 1 the more similar).

For more fun let’s apply two for the same pairs- Edit distance (Levenshtein distance) and Jaro-Winkler measure – see, the result differs? :) I used https://asecuritysite.com/forensics/simstring (sorry, this page has a bug in Jaro-Winkler algorithm – because it is not true (janis,martins) has 0 by JW (it should be 0.67) – but I could not find any other online calculator and for our experiment this bug is acceptable and let’s use it as an example how easy is to misuse method when we simply believe to somebody’s programmed result without understanding)

Sim_scores.PNG

Second step.

Choose threshold and keep only those who exceed. I have chosen threshold 0.5.

Threshold for Edit Distance

Threshold_edit_distance.PNG

Threshold for Jaro-Winkler

Threshold_Jaro_Winkler.PNG

Third step. Keep only the strongest bond.

To find that I draw all the upper threshold bonds at first.

Edit_distance_bonds

Martins and Martins are of course bonded. It means no any other bonds possible from or to.

Karlina to Karloina has the next strongest remaining bond. Again, no other bonds from/to.

We have left Anna to Annija because all other bonds relate to “engaged” entities. For examle, Kartina cannot bond to Karloina with 0.71 because Karlina bonded with 0.88.

We calculate then weight of matching by adding all the match scores (2.55) and divide by the (all name in class X plus all names in class B minus matchinmg pairs) = 0.319 = 32% similarity when we hybridise Jaccard with Edit Distance.

Edit_distance_bond.jpg

Now let’s do the same for Jaro-Winkler. First of all, all bonds upper than threshold:

Jaro-bonds.jpg

and keep only the strongest bonds. again you see, for example, Kartina cannot bond 0.88 with Martins because Martins bonded to Martins with 1. Kartina cannot also bond with 0.91 to Karloina because Karlina bonded to Karloina with 0.98.

And formula again – matching weight divided by sum of entities minues mathing pairs and – voila! = 0.707 = 71% similarity when we hybridise Jaccard with Jaro-Winkler measure.

Jaro_bond

I’ll remind that in my previous blog entry explaining Jaccard measure I showed that

  • similarity with correctly spelled names and requirement for name equality was 10%
  • similarity using bigrams was 50%
  • similarity with trigrams was 26%

Today we calculated (on a slightly modified set – errors added)

  • similarity with hybrid method with Edit distance was 32%
  • similarity with hybrid method with Jaro-Winkler measure was 71%

Isn’t it funny – five different results? There is no silver bullet in string and set matching. Try and test, try and test, try and test… repeat.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: set similarity : TF/IDF scores and Feature vectors to devaluate terms common in other documents


Term Frequency, Inverse Document Frequency and Feature Vectors. Another great concept which is easy when you start doing it.

Which two of these strings are most likely about the same real world entity?

  • Apple Corporation, USA
  • IBM (USA) Corporation
  • Corp. Apple

I believe you see: first and third. Why do you think so? Because of Apple there? Don’t you see the Corporation, Corp and USA? You somehow understand these are ancillary words, less meaningful.

How to teach computer to recognize that context? If we look at string level, Edit distance or Jaro-winkler, Affine Gap or Smith-Waterman, noone will solve it, they all will consider these are more similar:

  • Apple Corporation, USA
  • IBM (USA) Corporation

Well, Jaccard distance would spot the answer we are looking for but it is nice to learn another sophisticated method with added value – recognizing frequently used terms.

Today’s topic: two sets are similar if they share many frequent terms unless these terms are common in other strings.

TF score will show us how frequently is this term used.

IDF score will who us is this terms common in other strings.

Feature vectors will reveal the most similar strings. A feature vector is a vector that contains information describing an object’s important characteristics – in our case these characteristics (measurable properties) will be TF and IDF scores.

Let’s start with some data cleaning.

1) remove commas and brackets (I might also remove a, the, and, semicolons etc)

  • Apple Corporation USA
  • IBM USA Corporation
  • Apple Corp

2) convert all to lowercase

3) tokenize (split strings into words) and apply Smith-Waterman and/or Jaro-Winkler to calculate that Corp is similar to Corporation – I’ll assume that these words are equal (I might have set Corporation in my example but I wanted to make this example more challenging to show that data cleaning and applying rules is a normal step)

4) convert ‘corp’ to ‘corporation’ for me not to explain each time that I have a function which recognizes similarity of these terms

Let’s tokenize our collection into a bag of terms and call them documents x,y,z.

  • x = {apple,corporation,usa}
  • y = {ibm,usa,corporation}
  • z = {apple,corporation}

Term set T is {apple,corporation,usa,ibm}

Frequency (TF)

is the number of times each term appears in each document.

TF(apple,x) = 1 (see, document x contains one time term 'apple')
TF(apple,y) = 0 (this document does not contain 'apple')
TF(apple,z) = 1
TF(corporation,x) = 1 
TF(corporation,y) = 1 
TF(corporation,z) = 1 
TF(usa,x) = 1
TF(usa,y) = 1
TF(usa,z) = 0
TF(ibm,x) = 0
TF(ibm,y) = 1
TF(ibm,z) = 0

From this we can also read that if this document contains this term, the score is >0, otherwise 0. TF score can’t be negative.

Inverse Document Frequency (IDF)

is the number of documents in collection (N) divided by documents containing this term (Nd)

IDF_formula.PNG

IDF(apple) = 3/2 (two of three contain apple) = 1.5
IDF(corporation) = 3/3 (all three contain corporation) = 1
IDF(usa) = 3/2 (two of three contain usa) = 1.5
IDF(ibm) = 3/2 (two of three contain ibm) = 1.5

Well, now we have numbers. And?

Feature vectors

Do you remember vectors from school? It was that something like an arrow. Plane flying to North with speed 850km/h is like one vector and the wind blowing North-West is like another vector. If wind’s velocity is higher the plane from the ground might seem to be slipping sideways a little.

Vectors are used in computing very, very often. Example: v = [R; G; B]; is a feature vector containing color components of a pixel or an object.

In our case

  • documents x,y,z will be the vectors
  • Term set T {apple,corporation,usa,ibm} represent the dimensions (yeah, if thousand of terms than we have thousand-dimensional vector, uhh!)
  • features will be calculated from TF and IDF scores.
  • we will find if vectors are blowing the same direction (their angles are close). If yes – we will consider these documents similar

Formula to calculate features for each of documents and each of terms is:

Featureformula.PNG

Now let’s calculate features to draw a table (matrix). Note that all our features are greater or equal to 0 (because negatives are not possible).

F(apple, x) = TF(apple, x) * IDF(apple) = 1 * 1.5 = 1.5
F(apple,y) =  0 * 1.5 = 0
F(apple,z) = 1 * 1.5 = 1.5
F(corporation,x) = 1 * 1 = 1
F(corporation,y) = 1 * 1 = 1
F(corporation,z) = 1 * 1 = 1
F(usa,x) = 1 * 1.5 = 1.5
F(usa,y) = 1 * 1.5 = 1.5
F(usa,z) = 0 * 1.5 = 0
F(ibm,x) = 0 * 1.5 = 0
F(ibm,x) = 1 * 1.5 = 1.5
F(ibm,x) = 0 * 1.5 = 0

Feature_matrix.PNG

These are coordinates of  our vector in four-dimensional space (as we have four terms).

X (1.5,1,1.5,0)
Y(0,1,1.5,1.5)
Z(1.5,1,0,0)

Our vectors can have features only greater or equal to 0 (because we can’t have negative term frequencies in a document) and it means they will never have angle greater than 90 and cosine will never be less than 0. See my example visualisation for a three three-dimensional vectors (https://academo.org/demos/3d-vector-plotter/)

Vectors.png

We have to find our three documens x,y,z feature vector pair which makes the closest angle. You remember trigonometry, don’t you? The more narrow the angle is, the larger is its cosine. Thus we have to calculate the cosine of the angle between vectors and pick the largest one.

The official formula – uh, looks impressive, doesnt it? (I wrote it using http://math.typeit.org/)

Cosine_formula.PNGIt is not as complex: sum of scalar products divided by sum of vectors length (magnitude).

  • vector x = {apple,corporation,usa} coordinates (1.5,1,1.5,0)
  • vector y = {ibm,usa,corporation} (0,1,1.5,1.5)
  • vector z = {apple,corporation} (1.5,1,0,0)
  • Term set T is {apple,corporation,usa,ibm}

vector pair x,y scalar product and magnitude

Vectorxy

Cosine of angle between vectors x,y is

cosalfa.png

vector pair xz

Vectorxz.png

vector pair y,z

Vectorsyz

The larger cosine, the closer angle. And winner most similart pair, having the closest angle is…

  • Apple Corporation, USA
  • Corp. Apple

Voila!

P.S. You might play also using this Feature formula:

logr.PNG

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: set similarity : q-grams, Overlap measure, Jaccard index, Jaccard distance


Today’s topic: is John Smith similar to Smith John? As you see, Edit distance 10 will measure as not similar, so as Needleman-Wunsch (score = – 5). Smith Waterman would just point to the two similar substrings. Hmm, are these strings really not similar?

I asked my kid: how could we find guys of which class have the most similar names to your class? He answered: well, we could take name by name and calculate Edit distance score for each and then sum results together?

Let’s have a look closer on some official methods.

Class one pupils {Katrina,Janis,Karlis,Martins,Anna,Karlina}

Class two pupils {Annija,Martins,Marta,Karolina,Ance}

First of all, let’s learn to obtain the set of comparable items – tokens.

Let’s start having names as our tokens. Set similarity operates with tokens, not “words” because words sound like are something meaningful for us. Tokens are… just tokens :) anything we have agree how to split the set – you’ll see several approaches.

Simple: if we use word a s token then string “David Alexander Smith” consists of three tokens: david, alexander, smith.

A bit more complex: we might agree that we during tokenisation remove common stop words like “of, the, and” or ” Mr, Ms, fon, da” etc.

Then string “Mr. David Alexander fon Smith” again consists of three tokens: david, alexander, smith.

q-grams or n-grams

This impressive name just means that we split the string in substrings of length q. I’ll use David and show you two approaches used – which approach you choose  later depends on your needs and further usage of results.

q=1 (unigram):

  • {D,a,v,i,d}

q=2 (bigram, sometimes called digram):

  • first approach: {#d,da,av,vi,id,d#} (I used the special character # to pad for the length)
  • second approach: {da,av,vi,id}

q=3 (trigram):

  • {##d,#da,dav,avi,vid,id#,d##}
  • {dav,avi,vid}

q=4 (four-gram):

  • {###d,##da,#dav,davi,avid,vid#,id##,d###}
  • {davi,avid}

q=5 (five-gram):

  • {####d,###da,##dav,#davi,david,#avid,##vid,###id,####d}
  • {david}

I used R for some tests, see it does the second approach:

R_qgrams

Similar strings have many common q-grams.

You might find it interesting that translation software most likely detects language by comparing q-grams to the statistical results of languages. This my blog entry could be split in bigrams and would have result “English” returned.

Q-gramming may be applied to word level also.

The quick brown fox jumps over a lazy dog.

bigram:

  • {# the,the quick, quick brown,brown fox, fox jumps,jumps over,over a,a lazy,lazy dog,dog #}
  • {the quick, quick brown,brown fox, fox jumps,jumps over,over a,a lazy,lazy dog}

trigram

  • {# # the, # the quick,the quick brown,quick brown fox,brown fox jumps,fox jumps over,jumps over a,over a lazy,a lazy dog,lazy dog #,dog # #}

etc.

Note it is a analyst’s (my, yours) decision if the method used requires some data cleaning like removal of punctuation, whitespaces, letter de-capitalising etc.

The overlap measure

Let’s do some simply examples for learning the idea: string 1 will be Carlo, string 2 Carol.

If I use bigrams for tokenisation.

Let S1 = set of tokens of string 1.

  • {#c,ca,ar,rl,lo,o#}

Let S2 = set of tokens of string 2

  • {#c,ca,ar,ro,ol,l#}

Overlap(S1,S2), O(S1,S2) = number of common tokens.

In our case common tokens are #c,ca,ar

O(S1,S2) = 3

If I use trigrams for tokenisation.

  • {##c,#ca,car,arl,rlo,lo#,o##}
  • {##c,#ca,car,aro,rol,ol#,l##}

O(S1,S2) = 3

The Jaccard index a.k.a coefficient a.k.a measure

Well, you saw the Overlap measure. What is it? It actually has no value itself without context. We need to make it usable. Let’s have an index always 0..1 – the more to 1, the higher is set similarity. I’ll also multiply by 100 to have percents.

Jaccard(S1,S2) = count of common tokens  in strings S1,S2 / count of all unique tokens in both S1,S2

Or rephrasing a bit more scientific:

Jaccard(S1,S2) = intersect of tokens S1,S2 / union of tokens S1,S2

Note: ‘union’, not ‘union all’. Union are all unique, while union all are all.

Jaccard(S1,S2) = overlap of tokens / total count of unique tokens

Knockout now with The Formula :)

Jaccard

String 1 Carlo, string 2 Carol.

If I use bigrams for tokenisation:

  • S1 = {#c,ca,ar,rl,lo,o#}
  • S2 = {#c,ca,ar,ro,ol,l#}
  • Overlap(S1,S2) = 3
  • total count of unique tokens = 9

Jaccard(S1,S2) = 3/9 = 0.33 = 33%

Visualisation from http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html

Jaccard_online

If I use trigrams for tokenisation:

  • S1 = {##c,#ca,car,arl,rlo,lo#,o##}
  • S2 = {##c,#ca,car,aro,rol,ol#,l##}
  • Overlap(S1,S2) = 3
  • total count of unique tokens = 11

Jaccard(S1,S2) = 3/11 = 0.27 = 27%

Interpreting the result:

  • Two sets having all tokens as common will be 100% similar. The closer to 100%, the more similarity (e.g. 90% is more similar than 89%).
  • If they have no common tokens, they are 0% similar.
  • The half –  50% – means that the two sets have common half of the members.

The Jaccard distance

Measure of how different or dissimilar two sets are. It is the complement of the Jaccard index and can be found by subtracting the Jaccard index/measure/coefficient from 1 (or percentage from 100%).

For the above example with bigrams, the Jaccard distance is 1 – 33% = 67%.

Trigrams have Jaccards distance = 100% – 25% = 75%

So what?

Here you might ask: hey, but you showed they are not usable at all! Carlo IS similar to Carol!

Let me remind I was showing the idea on a very simplified example and we are talking about set similarity, not string similarity.

To treat as a string, let’s use Jaccard for Carlo and Carol having q=1

  • {c,a,r,l,o}
  • {c,a,r,o,l}

Voila! Jaccard measure = 1, Jaccard distance = 0%.

Because THE SETS OF SYMBOLS are equal!

Jaccard_online2.PNG

Now back to sets and names in classes.

Tokenisation method 1 – names are tokens:

S1={Katrina,Janis,Karlis,Martins,Anna,Karlina}

S2={Annija,Martins,Marta,Karolina,Ance}

O = 1 (Martins)

Total count of unique tokens = 10

Jaccard measure = 1/10 = 0.1 = 10%

Jaccard distance = 90% (remember: not similarity, but dissimilarity)

Conclusion: by method 1 names in these classes are not similar.

And now let’s calculate the same using bigrams :)

Tokenisation method 2 – bigrams are tokens:

I used R for splitting to bigrams

R-trigrams.PNG

and Excel for visualisation

Bigrams.PNG

Jaccard measure = 15/30 = 0.5 = 50%

Jaccard distance = 100% – 50% = 50%

Conclusion: by method 2 names in these classes are half similar.

Tokenisation method 3 – trigrams are tokens:

Trigrams.PNG

Jaccard measure = 11/43 = 0.26 = 26%

Jaccard distance = 100% – 26% = 74%

Conclusion: by method 3 names in these classes are quite a little similar.

Have you got the idea? Now you can apply the same method on different classes and have a reliable method to measure similarity.

Which one to choose?

The method which suits YOU the best. I can’t predict, will you consider Marta and Martins similar or dissimilar in YOUR business case.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: string similarity: dealing with typos (Jaro meausre, ups, measure)


Have you evre had a typo? How to instruct the computer that Luxmeb is most likely meant Luxembourg? (try to write in google and, see, it knows! :) let’s have look – how.

For this type of short strings like words Jaro measrue, ups, Jaro measure or Jaro distance or Jaro score helps.

The higher the Jaro distance is for two words is, the more similar these words are. The score 0 means no similarity and 1 means an exact match. Sometimes result is used in percents, I’ll show both.

The Jaro measure idea is assumption that

if one word contains near the same letters as the other and these letters can quite easily be reordered

to match – like length and lenght – or teh/the, dacnign/dancing

–  then words are similar.

Values used in the calculation of the Jaro Distance:

  1. Length of words being compared
  2. Count of common characters
  3. Count of transpose (exchange) operations needed to reorder common characters to equal order

Official formula looks complex, like scientists usually do to impress us:

Jaro1

(source: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)

It is really easy, let’s start with Jon and John.

  • s1 = Jon length 3
  • s2 = John length 4
  • m = 3 (three common characters J, o, n
  • t = 0 (no symbol transpositions needed Jon to John, because matching symbols are already in the same order)

jaro(Jon,John) = 1 / 3 x (3 matching sumbols /3 length of Jon + 3 matching symbols / 4 length of John + ((3 matching symbols – 0 transpositions)/3 matching symbols) = 1 / 3 x (3/3 + 3/4 + 3/3) = 1/3 x (1+ 0.75 + 1) = 1/3 x 2.75 = 0.917 (quite similar) or 92%

Edit distance of Jon, John = 1. Levenstein Distance Measure is  1 – d(x,y) / [max(length(x), length(y))] : s(Jon,John) = 1 – 1 / [max(3, 4)] = 1 – 1/4 = 0.75 or 75% similarity

You see, Jaro score 92% is better than Levenshtein’s 75%.

and also better than

  • Needleman-Wunsch 75% (scoring 1;-1;-1) (three matches 1×3 + one gap penalty -1 = 2)
  • Smith-Waterman 83% (Jo and Jo, n and n are the best matching substrings)

(calculator to play – https://asecuritysite.com/forensics/simstring)

Let’s calculate Jaro now for Jon and Ojhn

  • s1 = Jon length 3
  • s2 = Ojhn length 4
  • m = 3 (three common characters J, o, n)
  • t = 1 (transpositions needed to reorder common symbols (ojn – jon)

Jaro = 1 / 3 x (3/3 + 3/4 + (3-1)/3) = 0.81 or 81% similarity

Jaro for Jon and Ohnj (I reordered symbols)

  • s1 = Jon length 3
  • s2 = Ohnj length 4
  • m = 3 (three common characters J, o, n)
  • t = 2 (transpositions needed to reorder common symbols (onj – ojn – jon)

Jaro = 1 / 3 x (3/3 + 3/4 + (3-2)/3) = 0.70 or 70% similarity

Levenstein distance for JON, OHNJ is 3, measure is 1 – 3/4 = 0.25 or 25% similarity. See the difference?

Jaro for Luxmeb and Luxembourg

  • s1 = Luxmeb length 6
  • s2 = Luxembourg length 10
  • m = 6 (seven common characters L,u,x,e,m,b)
  • t = 1 (transpositions needed to reorder common symbols:  Luxmeb – Luxemb

Jaro = 1 / 3 x (6/6 + 6/10 + (6-1)/6) = 0.81 is 81% similarity

Edit distance result is 1 – 6 / [max(6,10)] = 1 – 6/10 = 0.4 is 40% similarity.

A variation to handle prefixes is Jaro – Winkler measure.

In real life this is a very common situation the strings have the same beginning (I bleieve you all konw tihs jkoe). Like beginnign, Luxmebogr, lenhgt, meausre. To take that into account and improve scoring, there is a measure

Jaro2.PNG

  • dw is Jaro-Winkler measure result
  • dj is Jaro distance – like we calculated before
  • l is the length of longest common exact matching prefix (often referred as max possible 4)
  • p is the weight given to the prefix (how important it is for us) (default use to be 0.1)

Another way to express the same formula is

  • jaro-winkler(x,y) = (1 – PL*PW)*jaro(x,y) + PL*PW

where

  • PL = length of the longest common prefix
  • PW is a weight given to the prefix

Let’s calculate Jaro-Winkler for Luxmebogr and Luxembourg.

  • I’ll assume that prefix weight is recomended 0.1
  • Jaro distance (s1=9, s2=10, m=9, t=1) = 0.91 (91%)
  • Longest common exactly matching prefix is Lux

Jaro-Winkler distance is

  • 0.91 + (3×0.1x(1-0.91)) = 0.94 or 94% similarity (first formula)
  • (1-3×0.1)x0.91 + 3×0.1 = 0.94 or 94% similarity (second formula – of course, the same because formulas are equal, just written in another way)

Jaro-Winkler for Luxmeb and Luxembourg

Jaro was 0.81 or 81% similarity

0.81 + (3×0.1x(1-0.81)) = 0.87 or 87% similarity

Can’t believe it is so easy, can you?

There are many more string similarity measures and algorithms in the world, different time elapsed, different precision and a lot of adaptions for specialized cases like nucleotides.

However, if you have read all my string similarity posts, you should have a good foundation of the basic ones.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: string similarity: best matching substrings between two strings (Smith-Waterman algorithm)


Previous methods I wrote about considered matching all characters to all characters between two strings (when entire string is aligned, it is called global alignment). As you see from our new challenge, this is not suitable in some cases:

can we somehow find out that Mr. Alexander Fitzgerald, University of Canterbury is similar to Professor, Dr.sc.comp Alexander Williams Fitzgerald from Canterbury, New Zealand?

Edit distance = 54 (https://planetcalc.com/1721/) – very long.

Needleman-Wunsch: -32 (http://bioinformaticnotes.com/Needle-Water/) similarity quite low.

Better idea: find two substrings of given strings that are the most similar one to another

(when parts of strings are compared, it is called local alignment). We can already guess the best matching one will be ‘Fitzgerald’ but computer must be provided with logic. Method I am blogging about was first proposed by Temple F. Smith and Michael S. Waterman in 1981.

The main difference to the Needleman–Wunsch algorithm is that negative scoring matrix cells are reset to zero, thus now a match can be restarted at any position, not only the very corner. It means if you learn to calculate Needleman-Wunsch, it is easy to adapt to Smith-Waterman. Just stay positive :)

The idea is simple, yet brilliant:

  1. add one more condition to Needleman – Wunsch: the lowest possible value is 0, so the result will never be negative:

Smith-Waterman4.PNG

2.  when bactracking for the alignment, just find the highest value and go up until you reach 0. Repeat searching for the next highest as many times you want.

  • Best match is AWGHE to AW_HE
  • Next best: HEA to HEA

Smith-Waterman5.PNG

(source http://docencia.ac.upc.edu/master/AMPP/slides/ampp_sw_presentation.pdf)

Let’s look on a sample trying to fit the picture in the screen:

  • PROFESSOR WILLIAM SMITH DR.SC.COMP CANTERBURY
  • DUDE WILL SMITH NEIGHBOUR

I believe, YOU see that best matching substrings are WILL and SMITH. But computer don’t have the brains as you have. By using Smith-Waterman measure, the computer can only calculate it by predefined algorithm.

You probably cannot see there but in this picture SMITH to SMITH max result in the bottom-right is 18. WILL to WILL is 10.

Smith-Waterman.PNG

(source: http://bioinformaticnotes.com/Needle-Water/)

Here comes the full table of original example – ALEXANDER, FITZGERALD, CANTERBURY:

Smith-Waterman2.PNG

Let’s add some foreign language for fun – you’ll see that also Smith  and Waterman were not magicians, however we can do a good guess with their algorithm:

  • DR.ALEKSANDRS VILJAMS FICDŽERALDS KENTERBERIJA
  • MR. ALEXANDER FITZGERALD CANTERBURY

and we see most similar substrings:

  • KENTERBERI-CANTERBURY
  • ERALDSKENTERB-ERALD_CANTERB
  • FI-FI
  • ALEKSANDRS VI – ALEX_ANDER FI

Smith-Waterman3

When you google, you’ll see a lot of bioinformatics and related samples TGCTACCGTAA….. I prefer a bit more human readable examples.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

%d bloggers like this: