Big Data: Phonetic Similarity : Soundex – words are similar if they sound the same


I guess you have seen surnames like Meier and Mayer or Smith, Smyth, Smithe, Smeth, Smeeth. These might be as well correct as misspelt surnames – if you’d dictate them to me by phone, who knows what I’d write.

So far I was blogging about similarity algorithms based on string writing. Today let’s discuss finding a match if the words sound the same. The most commonly used phonetic measure is Soundex. It has several extensions and my today’s topic is

Soundex for English language.

Soundex calculates a four character code from a word  based upon the pronunciation and considers two words as similar if their codes are equal. The idea is that similar sounding letters have are assigned the same soundex code. Widely used in genealogy, in archives, searching ancestors, relatives, families, heirs.

  1. The first character is the starting letter of a word. (in a variation called “Reverse Soundex” prefixes the last letter instead of the first)
  2. Drop all other occurrences of a, e, i, o, u, y, h, w.
  3. Replace consonants after the first letter with digits as follows:
    • b, f, p, v → 1
    • c, g, j, k, q, s, x, z → 2
    • d, t → 3
    • l → 4
    • m, n → 5
    • r → 6
    • If two or more letters with the same number were adjacent in the original name (before step 2), or adjacent except for any intervening h and w, then omit all but the first.
    • Return the first four padded with 0 (padding means replace blanks with 0, like ‘Ahoi’ will have code A000, ‘Dude’ will have D300 – always four characters code).

Let’s have an example set – surnames. I used Oracle RDBMS this time.

Soundex_table.PNG

Now let’s compare similarities by three methods: Edit distance, Jaro-WInkler, Soundex.

Soundex_select.PNG

and here are the results. Notice the combinations we have: if I set a similarity threshold by Edit distance or Jaro-Winkler to 50% then we have several combinations. including false positives and false negatives:

  • all three methods match – like ‘Mirhe’
  • Jaro-Winkler and soundex match, but Edit distance doesn’t – like ‘Meiyar’
  • Jaro-Winkler match but Soundex doesn’t – like ‘Mayes’
  • Edit distance and Jaro-Winkler match but Soundex doesn’t – like ‘Mimre’ or ‘Mirfe’

Soundex_results.PNG

You see, Soundex is not a silver bullet and, as I have always been writing, we must try and test, test ad try.

I’ll show you one more weakness of Soundex:

Scwarz.PNG

From the three approaches I used Soundex is the only one which did not find similarity :)

Scwarz_res

Some of Soundex variants

  • The New York State Identification and Intelligence System (NYSIIS) algorithm maintains relative vowel positioning, while Soundex does not.
  • Daitch–Mokotoff Soundex (D–M Soundex) adaptation to Jews with Germanic or Slavic surnames, sometimes referred as “Eastern European Soundex”. Results of D-M Soundex are returned in an all-numeric format between 100000 and 999999, calculation is much more complex than Soundex.
  • Metaphone, Double Metaphone, Metaphone 3. Powerful and customisable rule set, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations.

I googled online Metaphone calculator, they say It’s more accurate than soundex – hardly can agree:

  • The metaphone code for Schwarzenegger is SXWRSNKR.
  • The metaphone code for Schvartzeneger is SXFRTSNJR.
  • These surnames do not have the same metaphone code.

Then I tried for one of my Soundex similarities and – again

  • The metaphone code for Meiyar is MYR.
  • The metaphone code for Mire is MR.
  • These surnames do not have the same metaphone code.

I was also searching for Soundex Latvian edition – I am quite sure it exists. I found this: http://www.lzp.gov.lv/images/stories/dokumenti/Zin_rezult_2008.pdf

2008. g. izstrādāts un pilveidots elastīgs universālas leksikona sistēmas datubāzes
modelis, kas paredz vienotas infrastruktūras (kopīgu indeksēšanas un atgriezeniskās
saites mehānismu u.c.) un funkcionalitātes (šķirkļu izvērstas meklēšanas un
konfigurējamas atainošanas u.c.) pieejamību visām datubāzē izvietotajām vārdnīcām
neatkarīgi no to šķirkļu shēmām. Attiecībā uz indeksēšanu un meklēšanu, latviešu
valodai tika pielāgots Soundex algoritms, lai nodrošinātu neprecīzi ievadītu, bet pēc
izrunas līdzīgu vārdu atrašanu. (A. Spektors)

P.s. Tiem, kas lasa arī latviešu valodā – šeit ir maziņš un mīlīgs foruma ieraksts, kā cilvēks cenšas izveidot meklēšanas ieteikumu rīku (“vai jūs domājāt XXYZZX?”)  https://exs.lv/say/16261/1441002-so-nakti-pavadiju-veidojot

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Advertisements

Mans viedoklis:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Mainīt )

Google photo

You are commenting using your Google account. Log Out /  Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out /  Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out /  Mainīt )

Connecting to %s

%d bloggers like this: