Python – How to do fuzzy string search without a heavy database


I have a mapping of catalog numbers to product names:

35  cozy comforter
35  warm blanket
67  pillow

and need a search that would find misspelled, mixed names like "warm cmfrter".

We have code using edit-distance (difflib), but it probably won't scale to the 18000 names.

I achieved something similar with Lucene, but as PyLucene only wraps Java that would complicate deployment to end-users.

SQLite doesn't usually have full-text or scoring compiled in.

The Xapian bindings are like C++ and have some learning curve.

Whoosh is not yet well-documented but includes an abusable spell-checker.

What else is there?

Best Solution

Apparently the only way to make fuzzy comparisons fast is to do less of them ;)

Instead of writing another n-gram search or improving the one in Whoosh we now keep a word index, retrieve all entries that have at least one (correctly spelled) word in common with the query, and use difflib to rank those. Works well enough in this case.