Big Data: combining string and set matching methods. One of hybrid similarity measures – generalised Jaccard index


It’s time to start to combine string and set matching methods – let’s have a look at one of Hybrid Similarity measures.

Classic Jaccard measure considers overlapping tokens (words, q-grams). To be considered as overlapped, the token must be identical. Jaccard index works very well when names are in mixed order like “Food and drinks”, “Drinks&Food”, “Serving foods, sweets, drinks”. However pure Jaccard is too restrictive when text contains errors.

I had an example in my Jaccard blog entry comparing classes by pupils’ names – Martins, Marta, Katrina, Ance etc. What if some of names were written with errors?

Generalized Jaccard measure helps.

First of all, we as usual convert the comparable string into tokens. I’ll reuse the example and put some errors in names

Class one pupils x = {Kartina,Janis,Karlis,Martins,Anna,Karlina}

Class two pupils y = {Annija,Martins,Matra,Karloina,Ance}

Today we’ll learn also soft overlap – using the most matching pairs.

First step. compare each pair

To compare we need a similarity measure s which returns values 0..1 (the closer to 1 the more similar).

For more fun let’s apply two for the same pairs- Edit distance (Levenshtein distance) and Jaro-Winkler measure – see, the result differs? :) I used https://asecuritysite.com/forensics/simstring (sorry, this page has a bug in Jaro-Winkler algorithm – because it is not true (janis,martins) has 0 by JW (it should be 0.67) – but I could not find any other online calculator and for our experiment this bug is acceptable and let’s use it as an example how easy is to misuse method when we simply believe to somebody’s programmed result without understanding)

Sim_scores.PNG

Second step.

Choose threshold and keep only those who exceed. I have chosen threshold 0.5.

Threshold for Edit Distance

Threshold_edit_distance.PNG

Threshold for Jaro-Winkler

Threshold_Jaro_Winkler.PNG

Third step. Keep only the strongest bond.

To find that I draw all the upper threshold bonds at first.

Edit_distance_bonds

Martins and Martins are of course bonded. It means no any other bonds possible from or to.

Karlina to Karloina has the next strongest remaining bond. Again, no other bonds from/to.

We have left Anna to Annija because all other bonds relate to “engaged” entities. For examle, Kartina cannot bond to Karloina with 0.71 because Karlina bonded with 0.88.

We calculate then weight of matching by adding all the match scores (2.55) and divide by the (all name in class X plus all names in class B minus matchinmg pairs) = 0.319 = 32% similarity when we hybridise Jaccard with Edit Distance.

Edit_distance_bond.jpg

Now let’s do the same for Jaro-Winkler. First of all, all bonds upper than threshold:

Jaro-bonds.jpg

and keep only the strongest bonds. again you see, for example, Kartina cannot bond 0.88 with Martins because Martins bonded to Martins with 1. Kartina cannot also bond with 0.91 to Karloina because Karlina bonded to Karloina with 0.98.

And formula again – matching weight divided by sum of entities minues mathing pairs and – voila! = 0.707 = 71% similarity when we hybridise Jaccard with Jaro-Winkler measure.

Jaro_bond

I’ll remind that in my previous blog entry explaining Jaccard measure I showed that

  • similarity with correctly spelled names and requirement for name equality was 10%
  • similarity using bigrams was 50%
  • similarity with trigrams was 26%

Today we calculated (on a slightly modified set – errors added)

  • similarity with hybrid method with Edit distance was 32%
  • similarity with hybrid method with Jaro-Winkler measure was 71%

Isn’t it funny – five different results? There is no silver bullet in string and set matching. Try and test, try and test, try and test… repeat.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Advertisements

Mans viedoklis:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Mainīt )

Google+ photo

You are commenting using your Google+ account. Log Out /  Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out /  Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out /  Mainīt )

Connecting to %s

%d bloggers like this: