Big Data: set similarity : q-grams, Overlap measure, Jaccard index, Jaccard distance

Today’s topic: is John Smith similar to Smith John? As you see, Edit distance 10 will measure as not similar, so as Needleman-Wunsch (score = – 5). Smith Waterman would just point to the two similar substrings. Hmm, are these strings really not similar?

I asked my kid: how could we find guys of which class have the most similar names to your class? He answered: well, we could take name by name and calculate Edit distance score for each and then sum results together?

Let’s have a look closer on some official methods.

Class one pupils {Katrina,Janis,Karlis,Martins,Anna,Karlina}

Class two pupils {Annija,Martins,Marta,Karolina,Ance}

First of all, let’s learn to obtain the set of comparable items – tokens.

Let’s start having names as our tokens. Set similarity operates with tokens, not “words” because words sound like are something meaningful for us. Tokens are… just tokens :) anything we have agree how to split the set – you’ll see several approaches.

Simple: if we use word a s token then string “David Alexander Smith” consists of three tokens: david, alexander, smith.

A bit more complex: we might agree that we during tokenisation remove common stop words like “of, the, and” or ” Mr, Ms, fon, da” etc.

Then string “Mr. David Alexander fon Smith” again consists of three tokens: david, alexander, smith.

q-grams or n-grams

This impressive name just means that we split the string in substrings of length q. I’ll use David and show you two approaches used – which approach you choose  later depends on your needs and further usage of results.

q=1 (unigram):

• {D,a,v,i,d}

q=2 (bigram, sometimes called digram):

• first approach: {#d,da,av,vi,id,d#} (I used the special character # to pad for the length)
• second approach: {da,av,vi,id}

q=3 (trigram):

• {##d,#da,dav,avi,vid,id#,d##}
• {dav,avi,vid}

q=4 (four-gram):

• {###d,##da,#dav,davi,avid,vid#,id##,d###}
• {davi,avid}

q=5 (five-gram):

• {####d,###da,##dav,#davi,david,#avid,##vid,###id,####d}
• {david}

I used R for some tests, see it does the second approach:

Similar strings have many common q-grams.

You might find it interesting that translation software most likely detects language by comparing q-grams to the statistical results of languages. This my blog entry could be split in bigrams and would have result “English” returned.

Q-gramming may be applied to word level also.

The quick brown fox jumps over a lazy dog.

bigram:

• {# the,the quick, quick brown,brown fox, fox jumps,jumps over,over a,a lazy,lazy dog,dog #}
• {the quick, quick brown,brown fox, fox jumps,jumps over,over a,a lazy,lazy dog}

trigram

• {# # the, # the quick,the quick brown,quick brown fox,brown fox jumps,fox jumps over,jumps over a,over a lazy,a lazy dog,lazy dog #,dog # #}

etc.

Note it is a analyst’s (my, yours) decision if the method used requires some data cleaning like removal of punctuation, whitespaces, letter de-capitalising etc.

The overlap measure

Let’s do some simply examples for learning the idea: string 1 will be Carlo, string 2 Carol.

If I use bigrams for tokenisation.

Let S1 = set of tokens of string 1.

• {#c,ca,ar,rl,lo,o#}

Let S2 = set of tokens of string 2

• {#c,ca,ar,ro,ol,l#}

Overlap(S1,S2), O(S1,S2) = number of common tokens.

In our case common tokens are #c,ca,ar

O(S1,S2) = 3

If I use trigrams for tokenisation.

• {##c,#ca,car,arl,rlo,lo#,o##}
• {##c,#ca,car,aro,rol,ol#,l##}

O(S1,S2) = 3

The Jaccard index a.k.a coefficient a.k.a measure

Well, you saw the Overlap measure. What is it? It actually has no value itself without context. We need to make it usable. Let’s have an index always 0..1 – the more to 1, the higher is set similarity. I’ll also multiply by 100 to have percents.

Jaccard(S1,S2) = count of common tokens  in strings S1,S2 / count of all unique tokens in both S1,S2

Or rephrasing a bit more scientific:

Jaccard(S1,S2) = intersect of tokens S1,S2 / union of tokens S1,S2

Note: ‘union’, not ‘union all’. Union are all unique, while union all are all.

Jaccard(S1,S2) = overlap of tokens / total count of unique tokens

Knockout now with The Formula :)

String 1 Carlo, string 2 Carol.

If I use bigrams for tokenisation:

• S1 = {#c,ca,ar,rl,lo,o#}
• S2 = {#c,ca,ar,ro,ol,l#}
• Overlap(S1,S2) = 3
• total count of unique tokens = 9

Jaccard(S1,S2) = 3/9 = 0.33 = 33%

Visualisation from http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html

If I use trigrams for tokenisation:

• S1 = {##c,#ca,car,arl,rlo,lo#,o##}
• S2 = {##c,#ca,car,aro,rol,ol#,l##}
• Overlap(S1,S2) = 3
• total count of unique tokens = 11

Jaccard(S1,S2) = 3/11 = 0.27 = 27%

Interpreting the result:

• Two sets having all tokens as common will be 100% similar. The closer to 100%, the more similarity (e.g. 90% is more similar than 89%).
• If they have no common tokens, they are 0% similar.
• The half –  50% – means that the two sets have common half of the members.

The Jaccard distance

Measure of how different or dissimilar two sets are. It is the complement of the Jaccard index and can be found by subtracting the Jaccard index/measure/coefficient from 1 (or percentage from 100%).

For the above example with bigrams, the Jaccard distance is 1 – 33% = 67%.

Trigrams have Jaccards distance = 100% – 25% = 75%

So what?

Here you might ask: hey, but you showed they are not usable at all! Carlo IS similar to Carol!

Let me remind I was showing the idea on a very simplified example and we are talking about set similarity, not string similarity.

To treat as a string, let’s use Jaccard for Carlo and Carol having q=1

• {c,a,r,l,o}
• {c,a,r,o,l}

Voila! Jaccard measure = 1, Jaccard distance = 0%.

Now back to sets and names in classes.

Tokenisation method 1 – names are tokens:

S1={Katrina,Janis,Karlis,Martins,Anna,Karlina}

S2={Annija,Martins,Marta,Karolina,Ance}

O = 1 (Martins)

Total count of unique tokens = 10

Jaccard measure = 1/10 = 0.1 = 10%

Jaccard distance = 90% (remember: not similarity, but dissimilarity)

Conclusion: by method 1 names in these classes are not similar.

And now let’s calculate the same using bigrams :)

Tokenisation method 2 – bigrams are tokens:

I used R for splitting to bigrams

and Excel for visualisation

Jaccard measure = 15/30 = 0.5 = 50%

Jaccard distance = 100% – 50% = 50%

Conclusion: by method 2 names in these classes are half similar.

Tokenisation method 3 – trigrams are tokens:

Jaccard measure = 11/43 = 0.26 = 26%

Jaccard distance = 100% – 26% = 74%

Conclusion: by method 3 names in these classes are quite a little similar.

Have you got the idea? Now you can apply the same method on different classes and have a reliable method to measure similarity.

Which one to choose?

The method which suits YOU the best. I can’t predict, will you consider Marta and Martins similar or dissimilar in YOUR business case.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

One response to this post.

1. […] had an example in my Jaccard blog entry comparing classes by pupils’ names – Martins, Marta, Katrina, Ance etc. What if some […]

Patīk

Atbildēt