F.23. pg_trgm

The pg_trgm module provides functions and index classes for determining the similarity of text based on trigram matching.

F.23.1. Trigram (or Trigraph)

A trigram is a set of three consecutive characters taken from a string. A string is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams that comprise the string.

eg. The set of trigrams in the word "cat" is " c", " ca", "at " and "cat".

F.23.2. Public Functions

Table F-20. pg_trgm functions

FunctionDescription
real similarity(text, text)

Returns a number that indicates how closely matches the two arguments are. A zero result indicates that the two words are completely dissimilar, and a result of one indicates that the two words are identical.

real show_limit()

Returns the current similarity threshold used by the '%' operator. This in effect sets the minimum similarity between two words in order that they be considered similar enough to be misspellings of each other, for example.

real set_limit(real)

Sets the current similarity threshold that is used by the '%' operator, and is returned by the show_limit() function.

text[] show_trgm(text)

Returns an array of all the trigrams of the supplied text parameter.

Operator: text % text (returns boolean)

The '%' operator returns TRUE if its two arguments have a similarity that is greater than the similarity threshold set by set_limit(). It will return FALSE if the similarity is less than the current threshold.

F.23.3. Public Index Operator Class

The pg_trgm module comes with the gist_trgm_ops index operator class that allows a developer to create an index over a text column for the purpose of very fast similarity searches.

To use this index, the '%' operator must be used and an appropriate similarity threshold for the application must be set. Example:

CREATE TABLE test_trgm (t text);
CREATE INDEX trgm_idx ON test_trgm USING gist (t gist_trgm_ops);
  

At this point, you will have an index on the t text column that you can use for similarity searching. Example:

SELECT
        t,
        similarity(t, 'word') AS sml
FROM
        test_trgm
WHERE
        t % 'word'
ORDER BY
        sml DESC, t;
  

This will return all values in the text column that are sufficiently similar to 'word', sorted from best match to worst. The index will be used to make this a fast operation over very large data sets.

F.23.4. Text Search Integration

Trigram matching is a very useful tool when used in conjunction with a full text index.

The first step is to generate an auxiliary table containing all the unique words in the documents:

CREATE TABLE words AS SELECT word FROM 
        stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');
  

where documents is a table that has a text field bodytext that we wish to search. The use of the simple configuration with the to_tsvector function, instead of just using the already existing vector is to avoid creating a list of already stemmed words. This way, only the original, unstemmed words are added to the word list.

Next, create a trigram index on the word column:

CREATE INDEX words_idx ON words USING gist(word gist_trgm_ops);
  

or

CREATE INDEX words_idx ON words USING gin(word gist_trgm_ops);
  

Now, a SELECT query similar to the example above can be used to suggest spellings for misspelled words in user search terms. A useful extra clause is to ensure that the similar words are also of similar length to the misspelled word.

Note: Since the words table has been generated as a separate, static table, it will need to be periodically regenerated so that it remains up to date with the document collection.

F.23.5. References

GiST Development Site http://www.sai.msu.su/~megera/postgres/gist/

Tsearch2 Development Site http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

F.23.6. Authors

Oleg Bartunov , Moscow, Moscow University, Russia

Teodor Sigaev , Moscow, Delta-Soft Ltd.,Russia

Documentation: Christopher Kings-Lynne

This module is sponsored by Delta-Soft Ltd., Moscow, Russia.