## Big Data: hybrid similarity measure: the Soft TF/IDF Measure to deal with misspelt words

Some days ago I covered the topic about finding which two of these strings are most likely about the same real world entity – by recognizing requently used words and assigning them lower impact, thus first and third options were found as most similar:

• Apple Corporation, USA
• IBM (USA) Corporation
• Corp. Apple

Let’s add a challenge: misspell the word

• Apple Corporation, USA
• IBM (USA) Corporation
• Corp. Aple

And, as you might imagine, TF/IDF measure is not effective anymore because it cannot recognize Apple is similar to Aple as in classic implementation is looking for equality, not similarity. Today, similar like we did in generalised Jaccard, we will replace the requirement for words to be equal with requirement to be similar.

As I did before with TF/IDF, let’s remove commas, brackets and tokenize our collection into a bag of terms – documents x,y,z.

• x = {apple,corporation,usa}
• y = {ibm,usa,corporation}
• z = {aple,corp}

Full term set T = {apple,corporation,usa,ibm,aple,corp}

Now we will use any string similarity measure of our choice (and business need). I have chosen today Needleman-Wunsch and Jaro-Winkler to illustrate the differences, eg, s(apple,aple) = 0.8 by Needleman-Wunsch and s(apple,aple) = 0.93 by Jaro Winkler. I used calculator https://asecuritysite.com/forensics/simstring

Let’s choose similarity (it is my choice, it could be any other value 0..1)

• threshold k = 0.55 for Needleman-Wunsch measure
• threshold k = 0.45 for Jaro-Winkler measure Now we will reveal these terms which have similar term in the other document.

## Soft TF/IDF based on Needleman-Wunsch

Initial steps are the same as per classic TF/IDEf. Calculate:

### Frequency (TF)

is the number of times each term appears in each document.

Full term set T = {apple,corporation,usa,ibm,aple,corp} (as you see we will have six dimension vectors)

```TF(apple,x) = 1 (document x has one time apple)
TF(apple,y) = 0
TF(apple,z) = 0
TF(corporation,x) = 1
TF(corporation,y) = 1
TF(corporation,z) = 0
TF(usa,x) = 1
TF(usa,y) = 1
TF(usa,z) = 0
TF(ibm,x) = 0
TF(ibm,y) = 1
TF(ibm,z) = 0
TF(aple,x) = 0
TF(aple,y) = 0
TF(aple,z) = 1
TF(corp,x) = 0
TF(corp,y) = 0
TF(corp,z) = 1
```

### Inverse Document Frequency

IDF(apple) = 3/1 (one of three documents contains apple) = 3
IDF(corporation) = 3/2 = 1.5
IDF(usa) = 3/2 = 1.5
IDF(ibm) = 3/1 = 3
IDF(aple) = 3/1 = 3
IDF(corp) = 3/1 = 3

### Feature vectors (Feature = TF*IDF) Let’s normalise these vectors: remember from trigonometry, it means getting the unit vector of length 1: divide the coordinates by the length of the vector.

```length vector document x = sqrt((3*3) + (1.5*1.5) + (1.5*1.5)) = 3.67
length y = sqrt((1.5*1.5) + (1.5*1.5) + (3*3)) = 3.67
length z = sqrt((3*3) + (3*3)) = 4.24```

Normalised vectors for documents (each coordinate divided by vector’s length): ### Now Needleman-Wunsch scores come into a game.

We compute the close terms.

Close(x,y,0.55) = {(apple is not because its corporation in Document y is similar by 0.55 but there is another word – corporation which bonds by 1 and thus blocks apple’s similarity), corporation (because it has strongest bond with word corporation in document y), usa (because it has strongest bond to usa in document y)} • close(x,y,0.55) = {corporation,usa}
• close(x,z,0.55) = {apple}
• close(y,z,0.55) = {} (noone pair passed threshold) In the final step we compute features but giving a weight to each component to the TF/IDF formula. We are looking for the most closest vectors. The more narrow the angle is, the larger is its cosine. Thus we have to calculate the cosine of the angle between vectors and pick the largest one. As our vectors are normalised, cosine formula now is simple computing the dot (scalar) product.

• similarity(x,y) = x corporation coordinate 0.41 * y corporation coordinate 0.41 * Needleman-Wunsch similarity weight 1 + x usa coordinate 0.41 * y usa coordinate 0.41 * 1 = 0.34 = 34%
• similarity(x,z) = x apple 0.82 * z aple 0.71 * 0.8 = 0.46 = 46%
• similarity(y,z) = 0% (remember, that words pairs, incl. corp and corporation did not pass our threshhold)

Voila! By Needleman-Wunsch the most similar strings are

• Apple Corporation, USA
• Corp. Aple

## Now let’s recalculate soft TF/IDF using Jaro-Winkler.

Threshold k = 0.45 for Jaro-Winkler (I have set different from Needleman-Wunsch just for more fun to learn differences better).

See in pictures how do we get to the closest results by keeping the strongest bonds:   • close(x,y,0.45) = {corporation,usa}
• close(x,z,0.45) = {apple,corporation}
• close(y,z,0.45) = {corporation}

Now let’s calculate Features from normalized vectors (the same vectors we calculated) As these vectors are normalised, I’ll remind cosine formula is simple computing the dot (scalar) product.

• similarity(x,y) = x corporation coordinate 0.41 * y corporation coordinate 0.41 * Jaro-Winkler similarity weight 1 + x usa coordinate 0.41 * y usa coordinate 0.41 * 1 = 0.34 = 34%
• similarity(x,z) = x apple 0.82 * z aple 0.71 * 0.93 + x corporation 0.41 * z corp 0.71 * 0.87 = 0.79 = 79%
• similarity(y,z) = y corporation 0.41 * z corp 0.71 * 0.87 = 0.25 = 25%

Voila! By Jaro-Winkler the most similar strings again are

• Apple Corporation, USA
• Corp. Aple

Do you feel the power of the idea of this hybrid (string & set combined) similarity calculation method?

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.