I have been working on doing some fuzzy searching on PostgreSQL for a while, and thought I'd share some of my experiences here. Many people are not yet aware of the incredible fuzzy search capabilities of PostgreSQL, especially that were added in the 9.1 release.
Most people use an external engine such as Lucene to do fuzzy searches. While, as you will see, there are still reasons to do so, they are much fewer now than they used to be.
While PostgreSQL has lots of different fuzzy search options (soundex, levenshtein, etc.), the one which has the most support is trigram searching. What is a trigram? Trigrams break a word up into 3-letter sequences, which can then be used to do similarity matches. Think of the word "hello". It has the following trigrams: "h", "he", "hel", "ell", "llo", "lo", and "o". Now look at the word "hallo". It has the following trigrams: "h", "ha", "hal", "all", "llo", "lo", and "o". You can see that there are several trigrams that match, and several that don't match. Similarity is computed via cosine similarity (which I don't totally understand), but, basically, the more trigrams you have in common the closer the match. In the present case, 'hallo' and 'hello' have a similarity of 0.333333. You can see this in PostgreSQL by saying:
select similarity('hello', 'hallo')
PostgreSQL also has two other relevant operators that use similarity - the "<->" operator and the "%" operator. "<->" is the "distance" operator, which is simply one minus the similarity (i.e. if the similarity is 0.2 the distance will be 0.8). For text searching, "%" is the "similar" operator, which returns true if two strings are similar, or false if they are not. Two strings are defined as "similar" if their similarity is 0.3 or greater. This can be set with set_limit(), but it is not usually useful to do so (we will deal with custom similarities later). The "%" operator is important, because of its heavy use in indexing.
The trigram module (pg_trgm) is not installed by default in PostgreSQL, but is one of their standard (and supported!) extensions. To install, you need to make sure you build the module in the source code. To do this, go into the contrib/pg_trgm directory of the source code and do a "make; make install". This will install the plugin, but will not activate it. Then, in any database you want to use this plugin, put in the command:
CREATE EXTENSION pg_trgm;
This will load the necessary SQL to install the functions and operators needed to make trigrams work. If you type that command in the template1 database, then all future databases created on that system will have the extension installed.
The simplest use of trigrams would be to have a table with a name column, and do a search like this:
SELECT * FROM people WHERE lastname % 'Schmoe'
This will return a list of people where the trigram similarity is greater than the default threshold. If you want to specify the similarity you are looking for, you can change that to:
SELECT * FROM people WHERE similarity(lastname, 'Schmoe') > 0.5
This will give a less-tolerant match than the default.
The problem with doing this is that it takes a lot of processing power. The reason people usually use databases is because they have a huge number of records and need to be able to search them quickly. If we have 10,000 records, manually doing trigram matches will take quite a bit of time.
So, one of the best features of PostgreSQL 9.1 is the ability for PostgreSQL to incorporate trigrams into an index. So, to make an index to speed up our search, do:
CREATE INDEX people_lastname_trigram_idx ON people USING gist(lastname gist_trgm_ops);
This will create a "gist" index, which is a special type of index which is often used in spacial and other vector-oriented matching. The "gist_trgm_ops" is a little bit of magic that tells gist how to use a text field as a vector of trigrams.
Now, your query will use the index, provided you have enough rows to make it worthwhile. HOWEVER, it is not yet using the index efficiently. While PostgreSQL can (and will) use the index for this query, it doesn't actually check the condition until a later step (I believe, but am not sure, that it is using the index to get everything in the right order, and then at a later stage doing filtering). For large tables, this can still be several orders of magnitude worse than you want. To fix this, you MUST use the "%" operator. This will cause the index to have an "index condition", which it will use to limit the results that it puts out. So, if you have:
SELECT * FROM people WHERE lastname % 'Schmoe';
This will use the index AND the index condition, and will be super-fast (except in extremely large datasets - see next section). However, this uses the built-in cutoff point of 0.3 similarity. What if I want my results to be tighter than that? The best way is to combine % with similarity(), like this:
SELECT * FROM people WHERE lastname % 'Schmoe' AND similarity(lastname, 'Schmoe') > 0.5;
In this case, the database will use the "%" condition for the index, which will be incredibly fast, and the "similarity()" call as a post-index filter, which, since there is a much smaller dataset, will still be pretty fast.
I had trouble working with trigrams because my main dataset is about 14 million titles. At this size, trigram searching is really slow. With the whole database loaded in-memory (either in PG's cache or in the filesystem cache - I have many, many gigs of memory), a trigram search of the database USING THE INDEX took between 3 and 12 seconds, depending on the string. Poking around, I came out with the following basic results for a database on an Amazon EC2 4XL instance without any other load:
So, as you can see, the useful limit to these queries is about 1 million records. Therefore, if you have more than 14 million rows, if you want to do trigrams, you need to find a way to boost performance. I've found there are two basic methods you can use to speed up trigrams on large datasets:
#1 is much preferred if you can do it, because it is an entirely database-oriented method. Basically, what I did, is I figured out that on my queries I only needed to trigram search on about 1 million of my 14 million records. Therefore, I specialized the index by adding a "WHERE" clause to the index itself. On my database, I have a "type" field, and I'm only searching certain types of records. Therefore, I changed my index to the following:
CREATE INDEX people_lastname_restricted_trigram_idx ON people USING gist(lastname gist_trgm_ops) WHERE type IN ('my_type1', 'my_type2');
This meant that only about a million records were being stored in the index, which brought down my search time considerably. Another search that I was doing was based on the person's "popularity", in which case, I added another index for popularity being greater than 20. This restricted the set down to about 100,000 records, which made it blazingly fast.
The other alternative (#2) I mentioned is to create a separate table of words which is searched. The idea is that while you may have 14 million records, there probably aren't 14 million words. Therefore, you create a table using unique words from the database, and then add a trigram index on that. The problem, though, is that this takes either (a) a lot of manual work, or (b) a lot of triggers, none of which is the way I like to administer databases. Nonetheless, it can work if you are merely wanting to use trigrams to fuzzy-match individual words. It can be especially helpful in conjunction with PostgreSQL's full-text search (previously known as tsearch2), which still operates blazingly-fast on 14 million records. Fuzzy string matching on words can be combined with full-text search on records to present a powerful and fast method of searching. The wordlist can be fuzzy-matched to provide better hints at what words are in the fulltext.
To create a wordlist, do something like this:
CREATE TABLE search_words (word text);
CREATE INDEX search_words_word_trigram_idx ON search_words USING gist(word gist_trgm_ops);
Then, every time you need to regenerate your wordlist, do the following:
DELETE FROM search_words;
INSERT INTO search_words (word) SELECT DISTINCT lower(word) FROM (SELECT unnest(regexp_split_to_array(lastname, E'\\\\W+')) AS word FROM people) words WHERE word IS NOT NULL AND length(word) > 3;
What that does is explode the lastname field out into words (in case a lastname is multi-word), where each word is its own record. It also filters out excessively short words (optional). It then inserts these into the search_words table, which has a trigram index. So, now, if you want to fuzzy-match individual words, you can use the search_words table!
Anyway, additional fun can be had by using unaccent to remove accents before all of this processing, or other fun to massage your data however you feel is best. The important thing is that, for all but the most strenuous data sets, we can now keep all of the searching located within the database itself, managed by database tools, without having to rely on external search engines such as Lucene, which take time, management, machine power, and their own sets of headaches. Lucene itself often has to be hand-rigged for large datasets even. I am a big believer in pushing as much data work to the database as possible, and PostgreSQL keeps pushing the bar higher!