PostgreSQL 8.3beta1 Documentation | ||||
---|---|---|---|---|
Prev | Fast Backward | Chapter 12. Full Text Search | Fast Forward | Next |
Dictionaries are used to eliminate words that should not be considered in a search (stop words), and to normalize words so that different derived forms of the same word will match. Aside from improving search quality, normalization and removal of stop words reduce the size of the tsvector representation of a document, thereby improving performance. Normalization does not always have linguistic meaning and usually depends on application semantics.
Some examples of normalization:
Linguistic - ispell dictionaries try to reduce input words to a normalized form; stemmer dictionaries remove word endings
Identical URL locations are identified and canonicalized:
http://www.pgsql.ru/db/mw/index.html
http://www.pgsql.ru/db/mw/
http://www.pgsql.ru/db/../db/mw/index.html
Colour names are substituted by their hexadecimal values, e.g., red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF
Remove some numeric fractional digits to reduce the range of possible numbers, so 3.14159265359, 3.1415926, 3.14 will be the same after normalization if only two digits are kept after the decimal point.
A dictionary is a program which accepts lexemes as input and returns:
an array of lexemes if the input lexeme is known to the dictionary
a void array if the dictionary knows the lexeme, but it is a stop word
NULL if the dictionary does not recognize the input lexeme
Full text searching provides predefined dictionaries for many languages, and SQL commands to manipulate them. There are also several predefined template dictionaries that can be used to create new dictionaries by overriding their default parameters. Besides this, it is possible to develop custom dictionaries using an API; see the dictionary for integers (Section 12.9) as an example.
The ALTER TEXT SEARCH CONFIGURATION ADD MAPPING command binds specific types of lexemes and a set of dictionaries to process them. (Mappings can also be specified as part of configuration creation.) Lexemes are processed by a stack of dictionaries until some dictionary identifies it as a known word or it turns out to be a stop word. If no dictionary recognizes a lexeme, it will be discarded and not indexed. A general rule for configuring a stack of dictionaries is to place first the most narrow, most specific dictionary, then the more general dictionaries and finish it with a very general dictionary, like the snowball stemmer or simple, which recognizes everything. For example, for an astronomy-specific search (astro_en configuration) one could bind lword (latin word) with a synonym dictionary of astronomical terms, a general English dictionary and a snowball English stemmer:
ALTER TEXT SEARCH CONFIGURATION astro_en ADD MAPPING FOR lword WITH astrosyn, english_ispell, english_stem;
Function ts_lexize
can be used to test dictionaries,
for example:
SELECT ts_lexize('english_stem', 'stars'); ts_lexize ----------- {star} (1 row)
Also, the ts_debug
function (Section 12.8) is helpful for testing.
Stop words are words which are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching. For example, every English text contains words like a and the, so it is useless to store them in an index. However, stop words do affect the positions in tsvector, which in turn affect ranking:
SELECT to_tsvector('english','in the list of stop words'); to_tsvector ---------------------------- 'list':3 'stop':5 'word':6
The gaps between positions 1-3 and 3-5 are because of stop words, so ranks calculated for documents with and without stop words are quite different:
SELECT ts_rank_cd ('{1,1,1,1}', to_tsvector('english','in the list of stop words'), to_tsquery('list & stop')); ts_rank_cd ------------ 0.5 SELECT ts_rank_cd ('{1,1,1,1}', to_tsvector('english','list stop words'), to_tsquery('list & stop')); ts_rank_cd ------------ 1
It is up to the specific dictionary how it treats stop words. For example, ispell dictionaries first normalize words and then look at the list of stop words, while stemmers first check the list of stop words. The reason for the different behaviour is an attempt to decrease possible noise.
Here is an example of a dictionary that returns the input word as lowercase or NULL if it is a stop word; it also specifies the name of a file of stop words. It uses the simple dictionary as a template:
CREATE TEXT SEARCH DICTIONARY public.simple_dict ( TEMPLATE = pg_catalog.simple, STOPWORDS = english );
Now we can test our dictionary:
SELECT ts_lexize('public.simple_dict','YeS'); ts_lexize ----------- {yes} SELECT ts_lexize('public.simple_dict','The'); ts_lexize ----------- {}
Caution |
Most types of dictionaries rely on configuration files, such as files of stop words. These files must be stored in UTF-8 encoding. They will be translated to the actual database encoding, if that is different, when they are read into the server. |
This dictionary template is used to create dictionaries which replace a word with a synonym. Phrases are not supported (use the thesaurus dictionary (Section 12.4.3) for that). A synonym dictionary can be used to overcome linguistic problems, for example, to prevent an English stemmer dictionary from reducing the word 'Paris' to 'pari'. It is enough to have a Paris paris line in the synonym dictionary and put it before the english_stem dictionary:
SELECT * FROM ts_debug('english','Paris'); Alias | Description | Token | Dictionaries | Lexized token -------+-------------+-------+----------------+---------------------- lword | Latin word | Paris | {english_stem} | english_stem: {pari} (1 row) CREATE TEXT SEARCH DICTIONARY synonym (TEMPLATE = synonym, SYNONYMS = my_synonyms); ALTER TEXT SEARCH CONFIGURATION english ALTER MAPPING FOR lword WITH synonym, english_stem; SELECT * FROM ts_debug('english','Paris'); Alias | Description | Token | Dictionaries | Lexized token -------+-------------+-------+------------------------+------------------ lword | Latin word | Paris | {synonym,english_stem} | synonym: {paris} (1 row)
A thesaurus dictionary (sometimes abbreviated as TZ) is a collection of words which includes information about the relationships of words and phrases, i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred terms, related terms, etc.
Basically a thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, preserves them for indexing. Thesauruses are used during indexing so any change in the thesaurus requires reindexing. The current implementation of the thesaurus dictionary is an extension of the synonym dictionary with added phrase support. A thesaurus dictionary requires a configuration file of the following format:
# this is a comment sample word(s) : indexed word(s) more sample word(s) : more indexed word(s) ...
where the colon (:) symbol acts as a delimiter between a a phrase and its replacement.
A thesaurus dictionary uses a subdictionary (which is defined in the dictionary's configuration) to normalize the input text before checking for phrase matches. It is only possible to select one subdictionary. An error is reported if the subdictionary fails to recognize a word. In that case, you should remove the use of the word or teach the subdictionary about it. Use an asterisk (*) at the beginning of an indexed word to skip the subdictionary. It is still required that sample words are known.
The thesaurus dictionary looks for the longest match.
Stop words recognized by the subdictionary are replaced by a 'stop word placeholder' to record their position. To break possible ties the thesaurus uses the last definition. To illustrate this, consider a thesaurus (with a simple subdictionary) with pattern swsw, where s designates any stop word and w, any known word:
a one the two : swsw the one a two : swsw2
Words a and the are stop words defined in the configuration of a subdictionary. The thesaurus considers the one the two and that one then two as equal and will use definition swsw2.
As any normal dictionary, it can be assigned to the specific lexeme types. Since a thesaurus dictionary has the capability to recognize phrases it must remember its state and interact with the parser. A thesaurus dictionary uses these assignments to check if it should handle the next word or stop accumulation. The thesaurus dictionary compiler must be configured carefully. For example, if the thesaurus dictionary is assigned to handle only the lword lexeme, then a thesaurus dictionary definition like ' one 7' will not work since lexeme type digit is not assigned to the thesaurus dictionary.
To define a new thesaurus dictionary one can use the thesaurus template. For example:
CREATE TEXT SEARCH DICTIONARY thesaurus_simple ( TEMPLATE = thesaurus, DictFile = mythesaurus, Dictionary = pg_catalog.english_stem );
Here:
thesaurus_simple is the thesaurus dictionary name
mythesaurus is the base name of the thesaurus file (its full name will be $SHAREDIR/tsearch_data/mythesaurus.ths, where $SHAREDIR means the installation shared-data directory, often /usr/local/share).
pg_catalog.english_stem is the dictionary (Snowball English stemmer) to use for thesaurus normalization. Notice that the english_stem dictionary has its own configuration (for example, stop words), which is not shown here.
Now it is possible to bind the thesaurus dictionary thesaurus_simple and selected tokens, for example:
ALTER TEXT SEARCH CONFIGURATION russian ADD MAPPING FOR lword, lhword, lpart_hword WITH thesaurus_simple;
Consider a simple astronomical thesaurus thesaurus_astro, which contains some astronomical word combinations:
supernovae stars : sn crab nebulae : crab
Below we create a dictionary and bind some token types with an astronomical thesaurus and english stemmer:
CREATE TEXT SEARCH DICTIONARY thesaurus_astro ( TEMPLATE = thesaurus, DictFile = thesaurus_astro, Dictionary = english_stem ); ALTER TEXT SEARCH CONFIGURATION russian ADD MAPPING FOR lword, lhword, lpart_hword WITH thesaurus_astro, english_stem;
Now we can see how it works. Note that ts_lexize
cannot
be used for testing the thesaurus (see description of
ts_lexize
), but we can use
plainto_tsquery
and to_tsvector
which accept text arguments, not lexemes:
SELECT plainto_tsquery('supernova star'); plainto_tsquery ----------------- 'sn' SELECT to_tsvector('supernova star'); to_tsvector ------------- 'sn':1
In principle, one can use to_tsquery
if you quote
the argument:
SELECT to_tsquery('''supernova star'''); to_tsquery ------------ 'sn'
Notice that supernova star matches supernovae stars in thesaurus_astro because we specified the english_stem stemmer in the thesaurus definition.
To keep an original phrase in full text indexing just add it to the right part of the definition:
supernovae stars : sn supernovae stars SELECT plainto_tsquery('supernova star'); plainto_tsquery ----------------------------- 'sn' & 'supernova' & 'star'
The Ispell template dictionary for full text allows the creation of morphological dictionaries based on Ispell, which supports a large number of languages. This dictionary tries to change an input word to its normalized form. Also, more modern spelling dictionaries are supported - MySpell (OO < 2.0.1) and Hunspell (OO >= 2.0.2). A large list of dictionaries is available on the OpenOffice Wiki.
The Ispell dictionary allows searches without bothering about different linguistic forms of a word. For example, a search on bank would return hits of all declensions and conjugations of the search term bank, e.g. banking, banked, banks, banks', and bank's.
SELECT ts_lexize('english_ispell','banking'); ts_lexize ----------- {bank} SELECT ts_lexize('english_ispell','bank''s'); ts_lexize ----------- {bank} SELECT ts_lexize('english_ispell','banked'); ts_lexize ----------- {bank}
To create an ispell dictionary one should use the built-in ispell dictionary and specify several parameters.
CREATE TEXT SEARCH DICTIONARY english_ispell ( TEMPLATE = ispell, DictFile = english, AffFile = english, StopWords = english );
Here, DictFile, AffFile, and StopWords specify the names of the dictionary, affixes, and stop-words files.
Ispell dictionaries usually recognize a restricted set of words so they should be used in conjunction with another broader dictionary; for example, a stemming dictionary, which recognizes everything.
Ispell dictionaries support splitting compound words based on an ispell dictionary. This is a nice feature and full text searching in PostgreSQL supports it. Notice that the affix file should specify a special flag using the compoundwords controlled statement that marks dictionary words that can participate in compound formation:
compoundwords controlled z
Several examples for the Norwegian language:
SELECT ts_lexize('norwegian_ispell','overbuljongterningpakkmesterassistent'); {over,buljong,terning,pakk,mester,assistent} SELECT ts_lexize('norwegian_ispell','sjokoladefabrikk'); {sjokoladefabrikk,sjokolade,fabrikk}
Note: MySpell does not support compound words. Hunspell has sophisticated support for compound words. At present, full text searching implements only the basic compound word operations of Hunspell.
The Snowball dictionary template is based on the project of Martin Porter, inventor of the popular Porter's stemming algorithm for the English language. Snowball now provides stemming algorithms for many languages (see the Snowball site for more information). Each algorithm understands how to reduce common variant forms of words to a base, or stem, spelling within its language. A Snowball dictionary requires a language parameter to identify which stemmer to use, and optionally can specify a stopword file name that gives a list of words to eliminate. (PostgreSQL's standard stopword lists are also provided by the Snowball project.) For example, there is a built-in definition equivalent to
CREATE TEXT SEARCH DICTIONARY english_stem ( TEMPLATE = snowball, Language = english, StopWords = english );
The Snowball dictionary recognizes everything, so it is best to place it at the end of the dictionary stack. It it useless to have it before any other dictionary because a lexeme will never pass through it to the next dictionary.
The ts_lexize
function facilitates dictionary testing:
ts_lexize(dict_name text, lexeme text) returns text[]
Returns an array of lexemes if the input lexeme is known to the dictionary dictname, or a void array if the lexeme is known to the dictionary but it is a stop word, or NULL if it is an unknown word.
SELECT ts_lexize('english_stem', 'stars'); ts_lexize ----------- {star} SELECT ts_lexize('english_stem', 'a'); ts_lexize ----------- {}
Note: The
ts_lexize
function expects a lexeme, not text. Below is an example:SELECT ts_lexize('thesaurus_astro','supernovae stars') is null; ?column? ---------- tThe thesaurus dictionary thesaurus_astro does know supernovae stars, but
ts_lexize
fails since it does not parse the input text and considers it as a single lexeme. Useplainto_tsquery
andto_tsvector
to test thesaurus dictionaries:SELECT plainto_tsquery('supernovae stars'); plainto_tsquery ----------------- 'sn'
A full text configuration specifies all options necessary to transform a
document into a tsvector: the parser breaks text into tokens,
and the dictionaries transform each token into a lexeme. Every call to
to_tsvector()
and to_tsquery()
needs a configuration to perform its processing. To facilitate management
of full text searching objects, a set of SQL commands
is available, and there are several psql commands which display information
about full text searching objects (Section 12.7).
The configuration parameter default_text_search_config specifies the name of the current default configuration, which is the one used by text search functions when an explicit configuration parameter is omitted. It can be set in postgresql.conf, or set for an individual session using the SET command.
Several predefined text searching configurations are available in the pg_catalog schema. If you need a custom configuration you can create a new text searching configuration and modify it using SQL commands.
New text searching objects are created in the current schema by default (usually the public schema), but a schema-qualified name can be used to create objects in the specified schema.
As an example, we will create a configuration pg which starts as a duplicate of the english configuration. To be safe, we do this in a transaction:
BEGIN; CREATE TEXT SEARCH CONFIGURATION public.pg ( COPY = english );
We will use a PostgreSQL-specific synonym list and store it in share/tsearch_data/pg_dict.syn. The file contents look like:
postgres pg pgsql pg postgresql pg
We define the dictionary like this:
CREATE TEXT SEARCH DICTIONARY pg_dict ( TEMPLATE = synonym SYNONYMS = pg_dict );
Then register the ispell dictionary english_ispell using the ispell template:
CREATE TEXT SEARCH DICTIONARY english_ispell ( TEMPLATE = ispell, DictFile = english, AffFile = english, StopWords = english );
Now modify mappings for Latin words for configuration pg:
ALTER TEXT SEARCH CONFIGURATION pg ALTER MAPPING FOR lword, lhword, lpart_hword WITH pg_dict, english_ispell, english_stem;
We do not index or search some tokens:
ALTER TEXT SEARCH CONFIGURATION pg DROP MAPPING FOR email, url, sfloat, uri, float;
Now, we can test our configuration:
SELECT * FROM ts_debug('public.pg', ' PostgreSQL, the highly scalable, SQL compliant, open source object-relational database management system, is now undergoing beta testing of the next version of our software: PostgreSQL 8.3. '); COMMIT;
With the dictionaries and mappings set up, suppose we have a table pgweb which contains 11239 documents from the PostgreSQL web site. Only relevant columns are shown:
=> \d pgweb Table "public.pgweb" Column | Type | Modifiers -----------+-------------------+----------- tid | integer | not null path | character varying | not null body | character varying | title | character varying | dlm | date |
The next step is to set the session to use the new configuration, which was created in the public schema:
=> \dF List of text search configurations Schema | Name | Description ---------+------+------------- public | pg | SET default_text_search_config = 'public.pg'; SET SHOW default_text_search_config; default_text_search_config ---------------------------- public.pg