<%@ Page MasterPageFile="~/Blankenthoughts.master" Title="Blankenthoughts: Obscure Knowledge: The Levenshtein Distance" %>

Monday, August 11, 2008

Recently while trying to complete my goal of reading the entire Interwebs, I stumbled upon something I thought the rest of you might be intrigued by: The Levenshtein Distance.

According to Wikipedia, it is "a metric for measuring the amount of difference between two sequences. The Levenshtien distance between two strings is given by the minimum number of operations needed to transform one string into the other."

An operation, in this case, is an insertion, deletion, or substitution of a single character.

It's named after Vladimir Iosifovich Levenshtein, the Russian scientist who developed it in 1965.

Jeff, why should I care?

The reason you should be interested in this is because this is how spell-checkers generally recommend words for you. Have you ever wondered how they figure out what you meant?

So how's it work?

Let's look at example to see how this measurement works. Let's figure out the distance from the word "developer" to the word "development."

1) developer --> developen (substituted the 'r' for the 'n')
2) developen --> developmen (inserted the 'm')
3) developmen --> development (inserted the 't')

So, the Levenshtein distance for those two words is: 3

Are there any other special rules?

Why yes, there are. Here's a couple of things to keep in mind about this measurement:

1) It can't be less than the difference between the length of the two strings.
2) It can't be greater than the longer of the two strings.
3) It can only be zero if the strings are identical.

So what does this look like in code?

I've put together a quick little C# console application to demonstrate this logic. You can download it here. Just change out the two strings I used above for anything.

So there you go!

Now you have a simple way to find words that are close matches to a user-entered string. Just zip through your term dictionary, picking out the words with the smallest distances. You'll probably find their match. Certainly there are many applications for this other than spell checking. Think about recommending customer names as a user tries to search for one. Or account numbers.

Let me know how YOU would use this!

kick it on DotNetKicks.com

Labels: ,

posted by Jeff Blankenburg, 3:51 PM

5 Comments:

I wrote an app several years ago that uses n-grams and computes the Euclidean distance to determine if database records were duplicates. I only had strings to compare - no concrete information like an ID, address, etc. that would help.

It worked quite well, and it's fairly simple to implement. I used tri-grams. Assuming you already know how to tokenize strings into a tri-gram, the algorithm is simple to implement.

The numbers I used that worked the best for my data set were from Jeremy Hylton's thesis: www.python.org/~jeremy/pubs/thesis/MIT-LCS-TR-678.ps.gz

Cheers,
Dave
commented by Blogger David, 6:54 PM  
Jeff,

We recently used a variant of the Levenshtein distance for strings and a Soundex implementation to help prevent the duplicate entry of patient information in a medical system. It is kind of a "fuzzy" match that can be tuned to be quite helpful.
commented by Blogger Bruce, 8:59 PM  
Soundex is not appropriate for matching character based data. It is only useful for matching sounds.
The Levenshtein Distance is more appropriate. I've had a lot of success with an n-gram method I wrote 10 years ago:
http://sqlblindman.googlepages.com/fuzzysearchalgorithm

--sqlblindman
commented by Anonymous Anonymous, 9:44 PM  
Great post Jeff! I always wondered how a spell checker determined what words to recommend.
commented by Blogger Shawn Domer, 8:32 AM  
I stumbled across this a while ago and had a short-lived obsession with it. This algorithm is WAY cool and I keep looking for a spot to use it but just haven't been able to find the right situation. Great post.
commented by Blogger Darrell Hawley, 9:43 AM  

Add a comment

January 2006
February 2006
March 2006
April 2006
May 2006
June 2006
July 2006
November 2006
December 2006
January 2007
February 2007
April 2007
May 2007
June 2007
July 2007
August 2007
September 2007
October 2007
November 2007
December 2007
January 2008
February 2008
March 2008
April 2008
May 2008
June 2008
July 2008
August 2008
September 2008
Codestock 2008 - Photos In Review
Streaming Live From Codestock!
NBC Olympics - In Silverlight 2!
Top 10 Things New Twitter Users Need To Know To Ge...
I hate Vista. I'm moving to Windows Mojave.
RSS is not pronounced Arse...
Mash Up Your Data...err...That's "Mesh"
Extraordinary Nut Snack
Live Messenger Punishment
CAPTCHA The Flag