Big Data: one of string similarity measures: EDIT DISTANCE (or Levenshtein distance)


Have you ever thought how does spellchecker, search engine, translation, speech recognition etc. software find replacement options for the word you entered?

User typed:

Crts

What would be YOUR software action? Correct to

  • Arts?
  • Cats?
  • Carts?
  • Cuts?
  • Cryptos?
  • Crots?
  • Rats?
  • Crtl?
  • CRC?

I hope, you’ll answer me – hey, it depends! Of course, it depends :) but how does your software know it? Is there some magic? Ah, no, only logic…

Here are some examples again: Google; Word + language English, Word + language Latvian

Crts-googleCrts-word-lvCrts-word-en

Have you ever wondered how are the data found which refer to the same real-world objects? Schwarzenegger, Svarceneger, Švartcnegers, Švarcenegers – why is the searching software so smart and understands that we are looking for the old, good Arnold?

Svarcene-google.png

Similarity measures

One of similarity measures – not the only one! – is Levenshtein distance, named after the Russian scientist Vladimir Levenshtein, often called also EDIT DISTANCE or minimum edit distance, the operation count needed to convert one phrase into the other. If you have enough patience to read this post, you will be able to calculate the Levenshtein distance yourself or at least understand jow the online calculators work.

The smaller is edit distance, the higher is the chance that words are similar.

NB: outside this blog entry there are other measures – Needleman-Wunch, Smith-Waterman, Jaro, Jaro-Winkler, Affine gap etc.

Have you ever played to see when Google switches to offering other words and trying to guess the reason?

Arnold_google

Change a bit and have different set:

Ceneger-google

Have you been playing with spellchecker options? I enjoy guessing algorithms and logic behind.

Despacito-spellcheck

Espacito-spellcheck

If you work in data integration area, this type of tasks might be one or your “business as usual” tasks – recognizing that customer “David Smith” is the same as “David R. Smith” and “Davod Smit”. Are addresses “K. Barona 117”and “117, Krisjana Barona street” the same?

The general term is string matching challenge.

Look, different software, but the option offered to Picadilla is the same :)

The minimum edit distance is the minimum number of editing operations (insert, delete, replace) to transform one string into the other (doing that vice versa, the second string will be converted back to the first).

Edit Operations

Deletion, insertion, and replacement (or sometimes called substitution) operations of characters may have assigned different weights. The usual choice is to set all three weights to 1, however different weights allow more flexible search strategies in lists of words. For example, you may consider that substitution costs 2. Or you may prefer insertion to deletion, by setting insert weights 1, delete weights 2.

  • Insertion weight = 1
    • Cts -> Cats : edit distance = 1
    • Cts -> Cites : edit distance = 2
    • Cts -> Cities : edit distance = 3
    • Cts -> Citizens : edit distance = 5
  • Deletion weight = 1
    • Carts -> Cats : edit distance = 1
    • Carts -> Car : edit distance = 2
    • Carts -> as : edit distance = 3
    • Carts -> a : edit distance = 4
  • Substitution weight = 1
    • Crts -> Cats : edit distance = 1
    • Crts -> Cups : edit distance = 2
    • Crts -> Avio : edit distance = 4

If we change weight – eg, by doing that we might rank up similar words where only delete required:

  • Substitution weight = 2 (consider it as equal to insert + delete)
    • Crts -> (insert a, delete r) Cats : edit distance = 2
    • Crts -> (insert up, delete rt)Cups : edit distance = 4
    • Crts -> (insert Avio, delete Crts) Avio : edit distance = 8

Full replacement – too easy for us

What we can see quite obvious, is that full replacement works always.

  • QQQ (length 3) -> SYNCHRO PHAZOTRON (length 17) max edit distance is 3 deletes + 17 inserts = 20 operations
  • PSYCHO (length 6) -> SYNCHRO PHAZOTRON (length 17) max edit distance is 6 + 17 = 23. But I see some characters are the same!
  • ROTORON (length 7) -> SYNCHRO PHAZOTRON (length 17) max edit distance is 7 + 17 = 24. As we see, all characters are the same, is it really 24 operations necessary?
  • STUDENT (length 7) -> TUDENTS (length 7) max edit distance is 7 + 7 = 14. Wait, it can’t be true, we just need to move one letter! Does the computer know this?

It means, if we need just to change strings, no worries, delete + insert (overwrite) and forget edit distance at all.

Add value by understanding

If we are going to add value, let’s understand the minimum edit distance. It is not just ANY sequence of operations to get the desired result – transformed one sting to another. This is THE MINIMUM number of operations needed. How can we find it? I’ll show you both guessing method and scientific method.

Let’s start with a very simple example. What’s the edit distance between

  • ACE
  • BASE

I believe you see the answer right now and say – only two operations needed, insert B and replace C to S. I just wanted to start with this very simple example because scientists for several decades were trying to find out the faster way how to find the edit distance, without calculating values for the whole matrix. They just believed – as I did – there should be a faster way instead of calculating the whole matrix! You will find the sad truth at the very end of this post.

So, maximum operations needed would be 7 (delete 3, insert 4). But we are looking for the minimum. To find it in a scientific way, we must fill a matrix, table cell by cell. This matrix size is length of phrase ACE multiplied by length of phrase BASE, in our case 3*4 = 12 cells.

By the way, here you see quadratic time concept: when the length of phrase increases a little, the number of cells in matrix increase much more lot.

If our second phrase will change by 1 symbol from BASE to BASIC, we will have 3*5=15 cells. 1 symbol, 3 new cells.

If by 4 to BASEMENT, then 3*5=24 cells. 4 symbols, 12 new cells. Imagine that now with larger phrases up to thousands… billions of symbols like in DNS sequences.

To find the minimum edit distance, we must fill the table cell by cell for three operations

  • INSERT operation, we’ll consider edit cost (weight) as 1
  • SUBSTITUTE (replace), we’ll consider edit cost (weight) as 0, if letters are equal, and 1, if letters are different
  • DELETION, we’ll consider edit cost (weight) as 1

Let’s fill the table

So, we draw a table and first we fill so called base columns with operations count

  • Transform empty string into string 1, in our case ACE (create string from nothing)
  • Transform string 2, in our case BASE into empty string (erase string)

Then we follow rules, filling the other cells:

  • If letters are not equal, we pick the minimal value of three cells: upper (insert), diag (left-upper (substitution)), left (delete) and add 1 operation (see three blue cells example for green cell)
  • If letters are equal, we copy over substitution cost from left-upper cell.

 

ACE-BASE-Table1

ACE-BASE-Table2.PNG

Let’s look at another, more cats and zoo related example

Are these words similar? when user typed Picadilla, shall we offer to search for similar pets like Crocodile? a la ‘you might consider searching crocodile’, Did you mean pizza delivery’? Did you mean Piccadilly Circus?

Which is more like PICADILLA:

  • CROCODILE or PICCADILLY?
  • Or CACTUS? Is Cactus similar at all?

Let’s start investigation, mr. Inspector Caps, if the words PICADILLA and CROCODILE are similar at all.

  • As we can set the rules ourselves, we will consider the strings are similar if minimum edit distance is less than 70% of string average length, in our case less than 70% from 9 is between 0..6. It is obvious for US the distance is not 0, because strings are not similar. But computer still needs to find it by following predefined logic.
  • We will also use the official similar measure value s(x,y) = 1 – d(x,y) / [max(length(x), length(y))]

We, humans, face there many options now in front of us to transform the words. As I wrote, we always might delete one string and insert another. But we will not do this. We will try think smart. So. Hmm, should we start as

  • substitute P -> C
  • substitute I -> R
    • CRCADILLA

Or maybe better start

  • Delete P
  • Delete I
    • CADILLA

Hmm. Ok. Let’s do it one way:

PICADILLA

CROCODILE

  • substitute P -> C (1 operation) CICADILLA
  • substitute I -> R (1 operation) CRCADILLA
  • substitute C -> O (1 operation) CROADILLA
  • substitute A -> C (1 operation) CROCDILLA
  • substitute D -> O (1 operation) CROCOILLA
  • substitute I -> D (1 operation) CROCODLLA
  • substitute L -> I (1 operation) CROCODILA
  • hurrah, L = L
  • substitute A -> E (1 operation) CROCODILE

Totals: 8 operations.

We see 8 is enough. But can we say 8 is the minimum edit distance?

Let’s try another way.

PICADILLA

CROCODILE

  • delete P (1 operation) ICADILLA
  • delete I (1 operation) CADILLA
  • hurrah, C = C
  • insert R (1 operation) CRADILLA
  • substitute A -> O (1 operation) CRODILLA
  • insert C (1 operation) CROCDILLA
  • insert O (1 operation) CROCODILLA
  • delete L (1 operation) CROCODILA
  • substitute A -> E (1 operation) CROCODILE

Heh, again 8. But I am quite sure there MUST be shorter distance. You know, intuition etc.

Let’s think.

Let’s see the letters we have:

PICADILLA

CROCODILE

Length of words is equal (9).

I have to do something with 6 outstanding letters PIAILA or CROOIE. Why did I have distance 8 before? Because I was not thinking at all.

Now:

  • substitute P -> I (1 operation) CICADILLA
  • substitute I -> R (1 operation) CRCADILLA
  • insert O (1 operation) CROCADILLA
  • substitute A -> O (1 operation) CROCODILLA
  • substitute L -> E (1 operation) CROCODILEA
  • delete A (1 operation) CROCODILE

Voila! We did it in distance 6. You might ask is it that simply – edit distance equals unmatching letters? No, it is not as simple, unfortunately.

Proof

And now let’s fill The Matrix similarly we did before to prove if we are correct.

Pica-croco-table1.PNG

Pica-croco-table2.PNG

Voila!

  • By our voluntarily definition the words PICADILLA and CROCODILE are similar
  • By official s(x,y) = 1 – d(x,y) / [max(length(x), length(y))] : s(PICADILLA,CROCODILE) = 1 – 6 / [max(9, 9)] = 1 – 6/9 = 0.33 , ahemm, not too similar….

As you previously saw, I found the shortest way in third guessing attempt because I did on my intuition. Computers have no intuition. Can I afford this type of guessing if I must transform long strings, eg, billions of symbols long?

I played a bit with online distance calculators.

  • Dillapica -> Crocodile, distance = 8
  • Alpilidac -> Crocodile, distance = 9

the more of matching letters are not in the same sequence, the longer distance is. In our case matching letters are CDL, so when I put them in opposite order I get the longer distance, in our case max long I could achieve was 6 + 3 = 9.

Hmm, it would be interesting to write a code for fun which calculates longest possible edit distance between two strings when the letters are in different order :)

Is word CACTUS similar to PICADILLA?

Cactus-Picadilla.PNG

1) Remember, we voluntarily defined similarity as distance is less than 70% of string average length, in this case 70% of (6 + 9)/2 = 5.25, so we will consider the words similar, if minimum edit distance is 0..5. We see the distance is 7 – not similar.

2) By official s(x,y) = 1 – d(x,y) / [max(length(x), length(y))] : s(PICADILLA,CACTUS) = 1 – 7 / [max(9, 6)] = 1 – 6/9 = 0.22 again, not similar.

Hamming distance

One of metrics, slightly similar to Edit distance, is between two strings of equal length is the number of positions at which the corresponding symbols are different. Other words to describe, are – minimum number of substitutions (replace) to transform one string to the other. Online calculator to play: http://valjok.blogspot.com/2015/08/hamming-distance-online.html

Hamming distance for {PICADILLA,CROCODILE} is 8. (Edit distance is 6)

Conclusions:

  • words PICADILLA and CROCODILE are more similar then CACTUS and PICADILLA by any of measures we used
  • we may define our rules where words PICADILLA and CROCODILE are considered similar. Following the same rules, words CACTUS and PICADILLA are not similar, but, if we have a business need, we could define even a measure boundaries where these words are considered being similar.

And now the sad truth: Arturs Backurs, absolvent of LU CS Faculty and MIT graduate proved in this thesis that For 40 years, computer scientists looked for a solution that doesn’t exist.

“For 40 years, computer scientists have tried in vain to find a faster way to do an important calculation known as “edit distance.” Thanks to groundbreaking work from two researchers at MIT, they now know the reason they’ve continually failed is because a faster method is actually impossible to create.”

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

One response to this post.

  1. […] « Big Data: one of string similarity measures: EDIT DISTANCE (or Levenshtein distance) […]

    Patīk

    Atbildēt

Mans viedoklis:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Mainīt )

Google photo

You are commenting using your Google account. Log Out /  Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out /  Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out /  Mainīt )

Connecting to %s

%d bloggers like this: