Big Data: hybrid similarity measure: the Soft TF/IDF Measure to deal with misspelt words


Some days ago I covered the topic about finding which two of these strings are most likely about the same real world entity – by recognizing requently used words and assigning them lower impact, thus first and third options were found as most similar:

  • Apple Corporation, USA
  • IBM (USA) Corporation
  • Corp. Apple

Let’s add a challenge: misspell the word

  • Apple Corporation, USA
  • IBM (USA) Corporation
  • Corp. Aple

And, as you might imagine, TF/IDF measure is not effective anymore because it cannot recognize Apple is similar to Aple as in classic implementation is looking for equality, not similarity. Today, similar like we did in generalised Jaccard, we will replace the requirement for words to be equal with requirement to be similar.

As I did before with TF/IDF, let’s remove commas, brackets and tokenize our collection into a bag of terms – documents x,y,z.

  • x = {apple,corporation,usa}
  • y = {ibm,usa,corporation}
  • z = {aple,corp}

Full term set T = {apple,corporation,usa,ibm,aple,corp}

Now we will use any string similarity measure of our choice (and business need). I have chosen today Needleman-Wunsch and Jaro-Winkler to illustrate the differences, eg, s(apple,aple) = 0.8 by Needleman-Wunsch and s(apple,aple) = 0.93 by Jaro Winkler. I used calculator https://asecuritysite.com/forensics/simstring

Let’s choose similarity (it is my choice, it could be any other value 0..1)

  • threshold k = 0.55 for Needleman-Wunsch measure
  • threshold k = 0.45 for Jaro-Winkler measure

soft_idf_scores.PNG

Now we will reveal these terms which have similar term in the other document.

Soft TF/IDF based on Needleman-Wunsch

Initial steps are the same as per classic TF/IDEf. Calculate:

Frequency (TF)

is the number of times each term appears in each document.

Full term set T = {apple,corporation,usa,ibm,aple,corp} (as you see we will have six dimension vectors)

TF(apple,x) = 1 (document x has one time apple)
TF(apple,y) = 0
TF(apple,z) = 0
TF(corporation,x) = 1 
TF(corporation,y) = 1 
TF(corporation,z) = 0 
TF(usa,x) = 1
TF(usa,y) = 1
TF(usa,z) = 0
TF(ibm,x) = 0
TF(ibm,y) = 1
TF(ibm,z) = 0
TF(aple,x) = 0
TF(aple,y) = 0
TF(aple,z) = 1
TF(corp,x) = 0
TF(corp,y) = 0
TF(corp,z) = 1

Inverse Document Frequency

IDF(apple) = 3/1 (one of three documents contains apple) = 3
IDF(corporation) = 3/2 = 1.5
IDF(usa) = 3/2 = 1.5
IDF(ibm) = 3/1 = 3
IDF(aple) = 3/1 = 3
IDF(corp) = 3/1 = 3

Feature vectors (Feature = TF*IDF)

softIDF-unnorm.PNG

Let’s normalise these vectors: remember from trigonometry, it means getting the unit vector of length 1: divide the coordinates by the length of the vector.

length vector document x = sqrt((3*3) + (1.5*1.5) + (1.5*1.5)) = 3.67
length y = sqrt((1.5*1.5) + (1.5*1.5) + (3*3)) = 3.67
length z = sqrt((3*3) + (3*3)) = 4.24

Normalised vectors for documents (each coordinate divided by vector’s length):

softIDF-norm.PNG

Now Needleman-Wunsch scores come into a game.

We compute the close terms.

Close(x,y,0.55) = {(apple is not because its corporation in Document y is similar by 0.55 but there is another word – corporation which bonds by 1 and thus blocks apple’s similarity), corporation (because it has strongest bond with word corporation in document y), usa (because it has strongest bond to usa in document y)}

SoftIDFxy.png

  • close(x,y,0.55) = {corporation,usa}
  • close(x,z,0.55) = {apple}
  • close(y,z,0.55) = {} (noone pair passed threshold)

SoftIDFyz.png

In the final step we compute features but giving a weight to each component to the TF/IDF formula. We are looking for the most closest vectors. The more narrow the angle is, the larger is its cosine. Thus we have to calculate the cosine of the angle between vectors and pick the largest one. As our vectors are normalised, cosine formula now is simple computing the dot (scalar) product.

  • similarity(x,y) = x corporation coordinate 0.41 * y corporation coordinate 0.41 * Needleman-Wunsch similarity weight 1 + x usa coordinate 0.41 * y usa coordinate 0.41 * 1 = 0.34 = 34%
  • similarity(x,z) = x apple 0.82 * z aple 0.71 * 0.8 = 0.46 = 46%
  • similarity(y,z) = 0% (remember, that words pairs, incl. corp and corporation did not pass our threshhold)

Voila! By Needleman-Wunsch the most similar strings are

  • Apple Corporation, USA
  • Corp. Aple

Now let’s recalculate soft TF/IDF using Jaro-Winkler.

Threshold k = 0.45 for Jaro-Winkler (I have set different from Needleman-Wunsch just for more fun to learn differences better).

See in pictures how do we get to the closest results by keeping the strongest bonds:

SoftIDFJaroxy.png

SoftIDFJaroxz.jpg

SoftIDFJaroyz.jpg

  • close(x,y,0.45) = {corporation,usa}
  • close(x,z,0.45) = {apple,corporation}
  • close(y,z,0.45) = {corporation}

Now let’s calculate Features from normalized vectors (the same vectors we calculated)

softIDF-norm

As these vectors are normalised, I’ll remind cosine formula is simple computing the dot (scalar) product.

  • similarity(x,y) = x corporation coordinate 0.41 * y corporation coordinate 0.41 * Jaro-Winkler similarity weight 1 + x usa coordinate 0.41 * y usa coordinate 0.41 * 1 = 0.34 = 34%
  • similarity(x,z) = x apple 0.82 * z aple 0.71 * 0.93 + x corporation 0.41 * z corp 0.71 * 0.87 = 0.79 = 79%
  • similarity(y,z) = y corporation 0.41 * z corp 0.71 * 0.87 = 0.25 = 25%

Voila! By Jaro-Winkler the most similar strings again are

  • Apple Corporation, USA
  • Corp. Aple

Do you feel the power of the idea of this hybrid (string & set combined) similarity calculation method?

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Advertisements

Mans viedoklis:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Mainīt )

Google+ photo

You are commenting using your Google+ account. Log Out /  Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out /  Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out /  Mainīt )

Connecting to %s

%d bloggers like this: