Friday 13 October 2017

Levenshtein Prefilter

This code will take a word and create a feature vector from it by simply using the letter frequency. It then finds the nearest matches (using euclidean distance dropping the square root) prior to performing a levenshtein distance calculation. The original plan was to completely replace the levenshtein distance but it did not work out, the words it matched were not a good selection. I tried various feature vectors (number of syllable, length etc..) but the letter frequency was as goods as anything else and took far less computation. The code is hardwired to a levenshtein distance of 2. In terms of performance the longer the search term the more time the prefilter saves. Doing some ad hoc testing I found that a four character term it is slightly faster (maybe 50% quicker) and for a length over 10 it starts being around 11 times. One thing to note is that the prefilter can sometimes remove too many terms. You can increase the maximum distance from 6 to 8 to reduce the filter rate (the parameter passed to IsCloseEnough()) but there is an obvious performance penalty.

I've uploaded the code to git hub, it requires CMake and GNU C++ (min version 4.8).

https://github.com/eskeels/fuzzywordsearch

No comments:

Post a Comment