Big Data: string similarity: dealing with typos (Jaro meausre, ups, measure)


Have you evre had a typo? How to instruct the computer that Luxmeb is most likely meant Luxembourg? (try to write in google and, see, it knows! :) let’s have look – how.

For this type of short strings like words Jaro measrue, ups, Jaro measure or Jaro distance or Jaro score helps.

The higher the Jaro distance is for two words is, the more similar these words are. The score 0 means no similarity and 1 means an exact match. Sometimes result is used in percents, I’ll show both.

The Jaro measure idea is assumption that

if one word contains near the same letters as the other and these letters can quite easily be reordered

to match – like length and lenght – or teh/the, dacnign/dancing

–  then words are similar.

Values used in the calculation of the Jaro Distance:

  1. Length of words being compared
  2. Count of common characters
  3. Count of transpose (exchange) operations needed to reorder common characters to equal order

Official formula looks complex, like scientists usually do to impress us:

Jaro1

(source: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)

It is really easy, let’s start with Jon and John.

  • s1 = Jon length 3
  • s2 = John length 4
  • m = 3 (three common characters J, o, n
  • t = 0 (no symbol transpositions needed Jon to John, because matching symbols are already in the same order)

jaro(Jon,John) = 1 / 3 x (3 matching sumbols /3 length of Jon + 3 matching symbols / 4 length of John + ((3 matching symbols – 0 transpositions)/3 matching symbols) = 1 / 3 x (3/3 + 3/4 + 3/3) = 1/3 x (1+ 0.75 + 1) = 1/3 x 2.75 = 0.917 (quite similar) or 92%

Edit distance of Jon, John = 1. Levenstein Distance Measure is  1 – d(x,y) / [max(length(x), length(y))] : s(Jon,John) = 1 – 1 / [max(3, 4)] = 1 – 1/4 = 0.75 or 75% similarity

You see, Jaro score 92% is better than Levenshtein’s 75%.

and also better than

  • Needleman-Wunsch 75% (scoring 1;-1;-1) (three matches 1×3 + one gap penalty -1 = 2)
  • Smith-Waterman 83% (Jo and Jo, n and n are the best matching substrings)

(calculator to play – https://asecuritysite.com/forensics/simstring)

Let’s calculate Jaro now for Jon and Ojhn

  • s1 = Jon length 3
  • s2 = Ojhn length 4
  • m = 3 (three common characters J, o, n)
  • t = 1 (transpositions needed to reorder common symbols (ojn – jon)

Jaro = 1 / 3 x (3/3 + 3/4 + (3-1)/3) = 0.81 or 81% similarity

Jaro for Jon and Ohnj (I reordered symbols)

  • s1 = Jon length 3
  • s2 = Ohnj length 4
  • m = 3 (three common characters J, o, n)
  • t = 2 (transpositions needed to reorder common symbols (onj – ojn – jon)

Jaro = 1 / 3 x (3/3 + 3/4 + (3-2)/3) = 0.70 or 70% similarity

Levenstein distance for JON, OHNJ is 3, measure is 1 – 3/4 = 0.25 or 25% similarity. See the difference?

Jaro for Luxmeb and Luxembourg

  • s1 = Luxmeb length 6
  • s2 = Luxembourg length 10
  • m = 6 (seven common characters L,u,x,e,m,b)
  • t = 1 (transpositions needed to reorder common symbols:  Luxmeb – Luxemb

Jaro = 1 / 3 x (6/6 + 6/10 + (6-1)/6) = 0.81 is 81% similarity

Edit distance result is 1 – 6 / [max(6,10)] = 1 – 6/10 = 0.4 is 40% similarity.

A variation to handle prefixes is Jaro – Winkler measure.

In real life this is a very common situation the strings have the same beginning (I bleieve you all konw tihs jkoe). Like beginnign, Luxmebogr, lenhgt, meausre. To take that into account and improve scoring, there is a measure

Jaro2.PNG

  • dw is Jaro-Winkler measure result
  • dj is Jaro distance – like we calculated before
  • l is the length of longest common exact matching prefix (often referred as max possible 4)
  • p is the weight given to the prefix (how important it is for us) (default use to be 0.1)

Another way to express the same formula is

  • jaro-winkler(x,y) = (1 – PL*PW)*jaro(x,y) + PL*PW

where

  • PL = length of the longest common prefix
  • PW is a weight given to the prefix

Let’s calculate Jaro-Winkler for Luxmebogr and Luxembourg.

  • I’ll assume that prefix weight is recomended 0.1
  • Jaro distance (s1=9, s2=10, m=9, t=1) = 0.91 (91%)
  • Longest common exactly matching prefix is Lux

Jaro-Winkler distance is

  • 0.91 + (3×0.1x(1-0.91)) = 0.94 or 94% similarity (first formula)
  • (1-3×0.1)x0.91 + 3×0.1 = 0.94 or 94% similarity (second formula – of course, the same because formulas are equal, just written in another way)

Jaro-Winkler for Luxmeb and Luxembourg

Jaro was 0.81 or 81% similarity

0.81 + (3×0.1x(1-0.81)) = 0.87 or 87% similarity

Can’t believe it is so easy, can you?

There are many more string similarity measures and algorithms in the world, different time elapsed, different precision and a lot of adaptions for specialized cases like nucleotides.

However, if you have read all my string similarity posts, you should have a good foundation of the basic ones.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Mans viedoklis:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Mainīt )

Google photo

You are commenting using your Google account. Log Out /  Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out /  Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out /  Mainīt )

Connecting to %s

%d bloggers like this: