Big Data: string similarity: best matching substrings between two strings (Smith-Waterman algorithm)


Previous methods I wrote about considered matching all characters to all characters between two strings (when entire string is aligned, it is called global alignment). As you see from our new challenge, this is not suitable in some cases:

can we somehow find out that Mr. Alexander Fitzgerald, University of Canterbury is similar to Professor, Dr.sc.comp Alexander Williams Fitzgerald from Canterbury, New Zealand?

Edit distance = 54 (https://planetcalc.com/1721/) – very long.

Needleman-Wunsch: -32 (http://bioinformaticnotes.com/Needle-Water/) similarity quite low.

Better idea: find two substrings of given strings that are the most similar one to another

(when parts of strings are compared, it is called local alignment). We can already guess the best matching one will be ‘Fitzgerald’ but computer must be provided with logic. Method I am blogging about was first proposed by Temple F. Smith and Michael S. Waterman in 1981.

The main difference to the Needleman–Wunsch algorithm is that negative scoring matrix cells are reset to zero, thus now a match can be restarted at any position, not only the very corner. It means if you learn to calculate Needleman-Wunsch, it is easy to adapt to Smith-Waterman. Just stay positive :)

The idea is simple, yet brilliant:

  1. add one more condition to Needleman – Wunsch: the lowest possible value is 0, so the result will never be negative:

Smith-Waterman4.PNG

2.  when bactracking for the alignment, just find the highest value and go up until you reach 0. Repeat searching for the next highest as many times you want.

  • Best match is AWGHE to AW_HE
  • Next best: HEA to HEA

Smith-Waterman5.PNG

(source http://docencia.ac.upc.edu/master/AMPP/slides/ampp_sw_presentation.pdf)

Let’s look on a sample trying to fit the picture in the screen:

  • PROFESSOR WILLIAM SMITH DR.SC.COMP CANTERBURY
  • DUDE WILL SMITH NEIGHBOUR

I believe, YOU see that best matching substrings are WILL and SMITH. But computer don’t have the brains as you have. By using Smith-Waterman measure, the computer can only calculate it by predefined algorithm.

You probably cannot see there but in this picture SMITH to SMITH max result in the bottom-right is 18. WILL to WILL is 10.

Smith-Waterman.PNG

(source: http://bioinformaticnotes.com/Needle-Water/)

Here comes the full table of original example – ALEXANDER, FITZGERALD, CANTERBURY:

Smith-Waterman2.PNG

Let’s add some foreign language for fun – you’ll see that also Smith  and Waterman were not magicians, however we can do a good guess with their algorithm:

  • DR.ALEKSANDRS VILJAMS FICDŽERALDS KENTERBERIJA
  • MR. ALEXANDER FITZGERALD CANTERBURY

and we see most similar substrings:

  • KENTERBERI-CANTERBURY
  • ERALDSKENTERB-ERALD_CANTERB
  • FI-FI
  • ALEKSANDRS VI – ALEX_ANDER FI

Smith-Waterman3

When you google, you’ll see a lot of bioinformatics and related samples TGCTACCGTAA….. I prefer a bit more human readable examples.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Mans viedoklis:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Mainīt )

Google photo

You are commenting using your Google account. Log Out /  Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out /  Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out /  Mainīt )

Connecting to %s

%d bloggers like this: