PostgreSQL 8.3beta1 Documentation | ||||
---|---|---|---|---|
Prev | Fast Backward | Chapter 12. Full Text Search | Fast Forward | Next |
To implement full text searching there must be a function to create a tsvector from a document and a tsquery from a user query. Also, we need to return results in some order, i.e., we need a function which compares documents with respect to their relevance to the tsquery. Full text searching in PostgreSQL provides support for all of these functions.
Full text searching in PostgreSQL provides
function to_tsvector
, which converts a document to
the tsvector data type. More details are available in Section 9.13.2, but for now consider a simple example:
SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats'); to_tsvector ----------------------------------------------------- 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
In the example above we see that the resulting tsvector does not contain the words a, on, or it, the word rats became rat, and the punctuation sign - was ignored.
The to_tsvector
function internally calls a parser
which breaks the document (a fat cat sat on a mat - it ate a
fat rats) into words and corresponding types. The default parser
recognizes 23 types. Each word, depending on its type, passes through a
group of dictionaries (Section 12.4). At the
end of this step we obtain lexemes. For example,
rats became rat because one of the
dictionaries recognized that the word rats is a plural
form of rat. Some words are treated as "stop words"
(Section 12.4.1) and ignored since they occur too
frequently and have little informational value. In our example these are
a, on, and it.
The punctuation sign - was also ignored because its
type (Space symbols) is not indexed. All information
about the parser, dictionaries and what types of lexemes to index is
documented in the full text configuration section (Section 12.4.9). It is possible to have
several different configurations in the same database, and many predefined
system configurations are available for different languages. In our example
we used the default configuration english for the
English language.
As another example, below is the output from the ts_debug
function ( Section 12.8 ), which shows all details
of the full text machinery:
SELECT * FROM ts_debug('english','a fat cat sat on a mat - it ate a fat rats'); Alias | Description | Token | Dictionaries | Lexized token -------+---------------+-------+--------------+---------------- lword | Latin word | a | {english} | english: {} blank | Space symbols | | | lword | Latin word | fat | {english} | english: {fat} blank | Space symbols | | | lword | Latin word | cat | {english} | english: {cat} blank | Space symbols | | | lword | Latin word | sat | {english} | english: {sat} blank | Space symbols | | | lword | Latin word | on | {english} | english: {} blank | Space symbols | | | lword | Latin word | a | {english} | english: {} blank | Space symbols | | | lword | Latin word | mat | {english} | english: {mat} blank | Space symbols | | | blank | Space symbols | - | | lword | Latin word | it | {english} | english: {} blank | Space symbols | | | lword | Latin word | ate | {english} | english: {ate} blank | Space symbols | | | lword | Latin word | a | {english} | english: {} blank | Space symbols | | | lword | Latin word | fat | {english} | english: {fat} blank | Space symbols | | | lword | Latin word | rats | {english} | english: {rat} (24 rows)
Function setweight()
is used to label
tsvector. The typical usage of this is to mark out the
different parts of a document, perhaps by importance. Later, this can be
used for ranking of search results in addition to positional information
(distance between query terms). If no ranking is required, positional
information can be removed from tsvector using the
strip()
function to save space.
Because to_tsvector
(NULL) can
return NULL, it is recommended to use
coalesce
. Here is the safe method for creating a
tsvector from a structured document:
UPDATE tt SET ti= setweight(to_tsvector(coalesce(title,'')), 'A') || ' ' || setweight(to_tsvector(coalesce(keyword,'')), 'B') || ' ' || setweight(to_tsvector(coalesce(abstract,'')), 'C') || ' ' || setweight(to_tsvector(coalesce(body,'')), 'D');
The following functions allow manual parsing control:
ts_parse(parser, document text, OUT tokid integer, OUT token text) returns SETOF RECORD
Parses the given document and returns a series of records, one for each token produced by parsing. Each record includes a tokid giving its type and a token which gives its content:
SELECT * FROM ts_parse('default','123 - a number'); tokid | token -------+-------- 22 | 123 12 | 12 | - 1 | a 12 | 1 | number
ts_token_type(parser, OUT tokid integer, OUT alias text, OUT description text) returns SETOF RECORD
Returns a table which describes each kind of token the parser might produce as output. For each token type the table gives the tokid which the parser uses to label each token of that type, the alias which names the token type, and a short description:
SELECT * FROM ts_token_type('default'); tokid | alias | description -------+--------------+----------------------------------- 1 | lword | Latin word 2 | nlword | Non-latin word 3 | word | Word 4 | email | Email 5 | url | URL 6 | host | Host 7 | sfloat | Scientific notation 8 | version | VERSION 9 | part_hword | Part of hyphenated word 10 | nlpart_hword | Non-latin part of hyphenated word 11 | lpart_hword | Latin part of hyphenated word 12 | blank | Space symbols 13 | tag | HTML Tag 14 | protocol | Protocol head 15 | hword | Hyphenated word 16 | lhword | Latin hyphenated word 17 | nlhword | Non-latin hyphenated word 18 | uri | URI 19 | file | File or path name 20 | float | Decimal notation 21 | int | Signed integer 22 | uint | Unsigned integer 23 | entity | HTML Entity
Ranking attempts to measure how relevant documents are to a particular query by inspecting the number of times each search word appears in the document, and whether different search terms occur near each other. Full text searching provides two predefined ranking functions which attempt to produce a measure of how a document is relevant to the query. In spite of that, the concept of relevancy is vague and very application-specific. These functions try to take into account lexical, proximity, and structural information. Different applications might require additional information for ranking, e.g. document modification time.
The lexical part of ranking reflects how often the query terms appear in the document, how close the document query terms are, and in what part of the document they occur. Note that ranking functions that use positional information will only work on unstripped tsvectors because stripped tsvectors lack positional information.
The two ranking functions currently available are:
ts_rank([ weights float4[]], vector TSVECTOR, query TSQUERY, [ normalization int4 ]) returns float4
This ranking function offers the ability to weigh word instances more heavily depending on how you have classified them. The weights specify how heavily to weigh each category of word:
{D-weight, C-weight, B-weight, A-weight}
If no weights are provided, then these defaults are used:
{0.1, 0.2, 0.4, 1.0}
Often weights are used to mark words from special areas of the document, like the title or an initial abstract, and make them more or less important than words in the document body.
ts_rank_cd([ weights float4[], ] vector TSVECTOR, query TSQUERY, [ normalization int4 ]) returns float4
This function computes the cover density ranking for the given document vector and query, as described in Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three Term Queries" in the "Information Processing and Management", 1999.
Since a longer document has a greater chance of containing a query term it is reasonable to take into account document size, i.e. a hundred-word document with five instances of a search word is probably more relevant than a thousand-word document with five instances. Both ranking functions take an integer normalization option that specifies whether a document's length should impact its rank. The integer option controls several behaviors which is done using bit-wise fields and | (for example, 2|4):
0 (the default) ignores the document length
1 divides the rank by 1 + the logarithm of the document length
2 divides the rank by the length itself
4 divides the rank by the mean harmonic distance between extents
8 divides the rank by the number of unique words in document
16 divides the rank by 1 + logarithm of the number of unique words in document
It is important to note that ranking functions do not use any global information so it is impossible to produce a fair normalization to 1% or 100%, as sometimes required. However, a simple technique like rank/(rank+1) can be applied. Of course, this is just a cosmetic change, i.e., the ordering of the search results will not change.
Several examples are shown below; note that the second example uses normalized ranking:
SELECT title, ts_rank_cd('{0.1, 0.2, 0.4, 1.0}',textsearch, query) AS rnk FROM apod, to_tsquery('neutrino|(dark & matter)') query WHERE query @@ textsearch ORDER BY rnk DESC LIMIT 10; title | rnk -----------------------------------------------+---------- Neutrinos in the Sun | 3.1 The Sudbury Neutrino Detector | 2.4 A MACHO View of Galactic Dark Matter | 2.01317 Hot Gas and Dark Matter | 1.91171 The Virgo Cluster: Hot Plasma and Dark Matter | 1.90953 Rafting for Solar Neutrinos | 1.9 NGC 4650A: Strange Galaxy and Dark Matter | 1.85774 Hot Gas and Dark Matter | 1.6123 Ice Fishing for Cosmic Neutrinos | 1.6 Weak Lensing Distorts the Universe | 0.818218 SELECT title, ts_rank_cd('{0.1, 0.2, 0.4, 1.0}',textsearch, query)/ (ts_rank_cd('{0.1, 0.2, 0.4, 1.0}',textsearch, query) + 1) AS rnk FROM apod, to_tsquery('neutrino|(dark & matter)') query WHERE query @@ textsearch ORDER BY rnk DESC LIMIT 10; title | rnk -----------------------------------------------+------------------- Neutrinos in the Sun | 0.756097569485493 The Sudbury Neutrino Detector | 0.705882361190954 A MACHO View of Galactic Dark Matter | 0.668123210574724 Hot Gas and Dark Matter | 0.65655958650282 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973 Rafting for Solar Neutrinos | 0.655172410958162 NGC 4650A: Strange Galaxy and Dark Matter | 0.650072921219637 Hot Gas and Dark Matter | 0.617195790024749 Ice Fishing for Cosmic Neutrinos | 0.615384618911517 Weak Lensing Distorts the Universe | 0.450010798361481
The first argument in ts_rank_cd
('{0.1, 0.2,
0.4, 1.0}') is an optional parameter which specifies the
weights for labels D, C,
B, and A used in function
setweight
. These default values show that lexemes
labeled as A are ten times more important than ones
that are labeled with D.
Ranking can be expensive since it requires consulting the tsvector of all documents, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since full text searching in a database should work without indexes . Moreover an index can be lossy (a GiST index, for example) so it must check documents to avoid false hits.
Note that the ranking functions above are only examples. You can write your own ranking functions and/or combine additional factors to fit your specific needs.
To present search results it is ideal to show a part of each document and
how it is related to the query. Usually, search engines show fragments of
the document with marked search terms. PostgreSQL full
text searching provides the function headline
that
implements such functionality.
ts_headline([ config_name text], document text, query TSQUERY, [ options text ]) returns text
The ts_headline
function accepts a document along with
a query, and returns one or more ellipsis-separated excerpts from the
document in which terms from the query are highlighted. The configuration
used to parse the document can be specified by its
config_name; if none is specified, the current
configuration is used.
If an options string is specified it should consist of a comma-separated list of one or more 'option=value' pairs. The available options are:
StartSel, StopSel: the strings with which query words appearing in the document should be delimited to distinguish them from other excerpted words.
MaxWords, MinWords: limit the shortest and longest headlines to output
ShortWord: this prevents your headline from beginning or ending with a word which has this many characters or less. The default value of three eliminates the English articles.
HighlightAll: boolean flag; if true the whole document will be highlighted
Any unspecified options receive these defaults:
StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE
For example:
SELECT ts_headline('a b c', 'c'::tsquery); headline -------------- a b <b>c</b> SELECT ts_headline('a b c', 'c'::tsquery, 'StartSel=<,StopSel=>'); ts_headline ------------- a b <c>
headline
uses the original document, not
tsvector, so it can be slow and should be used with care.
A typical mistake is to call headline()
for
every matching document when only ten documents are
shown. SQL subselects can help here; below is an
example:
SELECT id,ts_headline(body,q), rank FROM (SELECT id,body,q, ts_rank_cd (ti,q) AS rank FROM apod, to_tsquery('stars') q WHERE ti @@ q ORDER BY rank DESC LIMIT 10) AS foo;
Note that the cascade dropping of the parser
function
causes dropping of the ts_headline used in the full text search
configuration config_name.