
Recently while trying to complete my goal of reading the entire Interwebs, I stumbled upon something I thought the rest of you might be intrigued by:
The Levenshtein Distance.
According to Wikipedia, it is "a metric for measuring the amount of difference between two sequences. The Levenshtien distance between two strings is given by the minimum number of operations needed to transform one string into the other."
An operation, in this case, is an insertion, deletion, or substitution of a single character.
It's named after
Vladimir Iosifovich Levenshtein, the Russian scientist who developed it in 1965.
Jeff, why should I care?
The reason you should be interested in this is because this is how spell-checkers generally recommend words for you. Have you ever wondered how they figure out what you meant?
So how's it work?
Let's look at example to see how this measurement works. Let's figure out the distance from the word "developer" to the word "development."
1) developer --> developen (substituted the 'r' for the 'n')
2) developen --> developmen (inserted the 'm')
3) developmen --> development (inserted the 't')
So, the Levenshtein distance for those two words is: 3
Are there any other special rules?
Why yes, there are. Here's a couple of things to keep in mind about this measurement:
1) It can't be less than the difference between the length of the two strings.
2) It can't be greater than the longer of the two strings.
3) It can only be zero if the strings are identical.
So what does this look like in code?
I've put together a quick little C# console application to demonstrate this logic.
You can download it here. Just change out the two strings I used above for anything.
So there you go!
Now you have a simple way to find words that are close matches to a user-entered string. Just zip through your term dictionary, picking out the words with the smallest distances. You'll probably find their match. Certainly there are many applications for this other than spell checking. Think about recommending customer names as a user tries to search for one. Or account numbers.
Let me know how YOU would use this!

Labels: jeff blankenburg, levenshtein distance
It worked quite well, and it's fairly simple to implement. I used tri-grams. Assuming you already know how to tokenize strings into a tri-gram, the algorithm is simple to implement.
The numbers I used that worked the best for my data set were from Jeremy Hylton's thesis: www.python.org/~jeremy/pubs/thesis/MIT-LCS-TR-678.ps.gz
Cheers,
Dave