PostgreSQL 8.3beta1 Documentation | ||||
---|---|---|---|---|
Prev | Fast Backward | Chapter 9. Functions and Operators | Fast Forward | Next |
This section outlines all the functions and operators that are available for full text searching.
Full text search vectors and queries both use lexemes, but for different purposes. A tsvector represents the lexemes (tokens) parsed out of a document, with an optional position. A tsquery specifies a boolean condition using lexemes.
All of the following functions that accept a configuration argument can use a textual configuration name to select a configuration. If the option is omitted the configuration specified by default_text_search_config is used. For more information on configuration, see Section 12.4.9.
The operator @@ is used to perform full text searches:
TSVECTOR @@ TSQUERY TSQUERY @@ TSVECTOR
Returns true if TSQUERY is contained in TSVECTOR, and false if not:
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat'::tsquery; ?column? ---------- t SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'fat & cow'::tsquery; ?column? ---------- f
text @@ tsquery
Returns true if TSQUERY is contained in TEXT, and false if not:
SELECT 'a fat cat sat on a mat and ate a fat rat'::text @@ 'cat & rat'::tsquery; ?column? ---------- t SELECT 'a fat cat sat on a mat and ate a fat rat'::text @@ 'cat & cow'::tsquery; ?column? ---------- f
text @@ text
Returns true if the right argument (the query) is contained in the left argument, and false otherwise:
SELECT 'a fat cat sat on a mat and ate a fat rat' @@ 'cat rat'; ?column? ---------- t SELECT 'a fat cat sat on a mat and ate a fat rat' @@ 'cat cow'; ?column? ---------- f
For index support of full text operators consult Section 12.5.
to_tsvector([config_name], document TEXT) returns TSVECTOR
Parses a document into tokens, reduces the tokens to lexemes, and returns a tsvector which lists the lexemes together with their positions in the document in lexicographic order.
strip(vector TSVECTOR) returns TSVECTOR
Returns a vector which lists the same lexemes as the given vector, but which lacks any information about where in the document each lexeme appeared. While the returned vector is useless for relevance ranking it will usually be much smaller.
setweight(vector TSVECTOR, letter) returns TSVECTOR
This function returns a copy of the input vector in which every location has been labeled with either the letter A, B, or C, or the default label D (which is the default for new vectors and as such is usually not displayed). These labels are retained when vectors are concatenated, allowing words from different parts of a document to be weighted differently by ranking functions.
vector1 || vector2 tsvector_concat(vector1 TSVECTOR, vector2 TSVECTOR) returns TSVECTOR
Returns a vector which combines the lexemes and positional information of
the two vectors given as arguments. Positional weight labels (described
in the previous paragraph) are retained during the concatenation. This
has at least two uses. First, if some sections of your document need to be
parsed with different configurations than others, you can parse them
separately and then concatenate the resulting vectors. Second, you can
weigh words from one section of your document differently than the others
by parsing the sections into separate vectors and assigning each vector
a different position label with the setweight()
function. You can then concatenate them into a single vector and provide
a weights argument to the ts_rank()
function that assigns
different weights to positions with different labels.
length(vector TSVECTOR) returns INT4
Returns the number of lexemes stored in the vector.
text::TSVECTOR returns TSVECTOR
Directly casting text to a tsvector allows you to directly inject lexemes into a vector with whatever positions and positional weights you choose to specify. The text should be formatted to match the way a vector is displayed by SELECT.
tsvector_update_trigger(tsvector_column_name, config_name, text_column_name [, ... ]) tsvector_update_trigger_column(tsvector_column_name, config_column_name, text_column_name [, ... ])
Two built-in trigger functions are available to automatically update a tsvector column from one or more textual columns. An example of their use is:
CREATE TABLE tblMessages ( strMessage text, tsv tsvector ); CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON tblMessages FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(tsv, 'pg_catalog.english', strMessage);
Having created this trigger, any change in strMessage will be automatically reflected into tsv.
Both triggers require you to specify the text search configuration to
be used to perform the conversion. For
tsvector_update_trigger
, the configuration name is simply
given as the second trigger argument. It must be schema-qualified as
shown above, so that the trigger behavior will not change with changes
in search_path. For
tsvector_update_trigger_column
, the second trigger argument
is the name of another table column, which must be of type
regconfig. This allows a per-row selection of configuration
to be made.
ts_stat(sqlquery text [, weights text ]) returns SETOF statinfo
Here statinfo is a type, defined as:
CREATE TYPE statinfo AS (word text, ndoc integer, nentry integer);
and sqlquery is a text value containing a SQL query
which returns a single tsvector column. ts_stat
executes the query and returns statistics about the resulting
tsvector data, i.e., the number of documents, ndoc,
and the total number of words in the collection, nentry. It is
useful for checking your configuration and to find stop word candidates. For
example, to find the ten most frequent words:
SELECT * FROM ts_stat('SELECT vector from apod') ORDER BY ndoc DESC, nentry DESC, word LIMIT 10;
Optionally, one can specify weights to obtain statistics about words with a specific weight:
SELECT * FROM ts_stat('SELECT vector FROM apod','a') ORDER BY ndoc DESC, nentry DESC, word LIMIT 10;
TSVECTOR < TSVECTOR TSVECTOR <= TSVECTOR TSVECTOR = TSVECTOR TSVECTOR >= TSVECTOR TSVECTOR > TSVECTOR
All btree operations are defined for the tsvector type. tsvectors are compared with each other using lexicographical ordering.
to_tsquery([config_name], querytext text) returns TSQUERY
Accepts querytext, which should consist of single tokens
separated by the boolean operators & (and), |
(or) and ! (not), which can be grouped using parentheses.
In other words, to_tsquery
expects already parsed text.
Each token is reduced to a lexeme using the specified or current configuration.
A weight class can be assigned to each lexeme entry to restrict the search region
(see setweight
for an explanation). For example:
'fat:a & rats'
The to_tsquery
function can also accept a text
string. In this case querytext should
be quoted. This may be useful, for example, to use with a thesaurus
dictionary. In the example below, a thesaurus contains rule supernovae
stars : sn:
SELECT to_tsquery('''supernovae stars'' & !crab'); to_tsquery --------------- 'sn' & !'crab'
Without quotes to_tsquery
will generate a syntax error.
plainto_tsquery([config_name], querytext text) returns TSQUERY
Transforms unformatted text querytext to tsquery.
It is the same as to_tsquery
but accepts text
without quotes and will call the parser to break it into tokens.
plainto_tsquery
assumes the & boolean
operator between words and does not recognize weight classes.
querytree(query TSQUERY) returns TEXT
This returns the query used for searching an index. It can be used to test for an empty query. The SELECT below returns NULL, which corresponds to an empty query since GIN indexes do not support queries with negation (a full index scan is inefficient):
SELECT querytree(to_tsquery('!defined')); querytree -----------
text::TSQUERY returns TSQUERY
Directly casting text to a tsquery allows you to directly inject lexemes into a query using whatever positions and positional weight flags you choose to specify. The text should be formatted to match the way a vector is displayed by SELECT.
numnode(query TSQUERY) returns INTEGER
This returns the number of nodes in a query tree. This function can be used to determine if query is meaningful (returns > 0), or contains only stop words (returns 0):
SELECT numnode(plainto_tsquery('the any')); NOTICE: query contains only stopword(s) or does not contain lexeme(s), ignored numnode --------- 0 SELECT numnode(plainto_tsquery('the table')); numnode --------- 1 SELECT numnode(plainto_tsquery('long table')); numnode --------- 3
TSQUERY && TSQUERY returns TSQUERY
Returns AND-ed TSQUERY
TSQUERY || TSQUERY returns TSQUERY
Returns OR-ed TSQUERY
!! TSQUERY returns TSQUERY
negation of TSQUERY
TSQUERY < TSQUERY TSQUERY <= TSQUERY TSQUERY = TSQUERY TSQUERY >= TSQUERY TSQUERY > TSQUERY
All btree operations are defined for the tsquery type. tsqueries are compared to each other using lexicographical ordering.
Query rewriting is a set of functions and operators for the tsquery data type. It allows control at search query time without reindexing (the opposite of the thesaurus). For example, you can expand the search using synonyms (new york, big apple, nyc, gotham) or narrow the search to direct the user to some hot topic.
The ts_rewrite()
function changes the original query by
replacing part of the query with some other string of type tsquery,
as defined by the rewrite rule. Arguments to ts_rewrite()
can be names of columns of type tsquery.
CREATE TABLE aliases (t TSQUERY PRIMARY KEY, s TSQUERY); INSERT INTO aliases VALUES('a', 'c');
ts_rewrite (query TSQUERY, target TSQUERY, sample TSQUERY) returns TSQUERY
SELECT ts_rewrite('a & b'::tsquery, 'a'::tsquery, 'c'::tsquery); ts_rewrite ------------ 'b' & 'c'
ts_rewrite(ARRAY[query TSQUERY, target TSQUERY, sample TSQUERY]) returns TSQUERY
SELECT ts_rewrite(ARRAY['a & b'::tsquery, t,s]) FROM aliases; ts_rewrite ------------ 'b' & 'c'
ts_rewrite (query TSQUERY,'SELECT target ,sample FROM test'::text) returns TSQUERY
SELECT ts_rewrite('a & b'::tsquery, 'SELECT t,s FROM aliases'); ts_rewrite ------------ 'b' & 'c'
What if there are several instances of rewriting? For example, query 'a & b' can be rewritten as 'b & c' and 'cc'.
SELECT * FROM aliases; t | s -----------+------ 'a' | 'c' 'x' | 'z' 'a' & 'b' | 'cc'
This ambiguity can be resolved by specifying a sort order:
SELECT ts_rewrite('a & b', 'SELECT t, s FROM aliases ORDER BY t DESC'); ts_rewrite --------- 'cc' SELECT ts_rewrite('a & b', 'SELECT t, s FROM aliases ORDER BY t ASC'); ts_rewrite -------------- 'b' & 'c'
Let's consider a real-life astronomical example. We'll expand query supernovae using table-driven rewriting rules:
CREATE TABLE aliases (t tsquery primary key, s tsquery); INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn')); SELECT ts_rewrite(to_tsquery('supernovae'), 'SELECT * FROM aliases') && to_tsquery('crab'); ?column? ------------------------------- ( 'supernova' | 'sn' ) & 'crab'
Notice, that we can change the rewriting rule online:
UPDATE aliases SET s=to_tsquery('supernovae|sn & !nebulae') WHERE t=to_tsquery('supernovae'); SELECT ts_rewrite(to_tsquery('supernovae'), 'SELECT * FROM aliases') && to_tsquery('crab'); ?column? ----------------------------------------------- 'supernova' | 'sn' & !'nebula' ) & 'crab'
Rewriting can be slow for many rewriting rules since it checks every rule for a possible hit. To filter out obvious non-candidate rules there are containment operators for the tsquery type. In the example below, we select only those rules which might contain the original query:
SELECT ts_rewrite(ARRAY['a & b'::tsquery, t,s]) FROM aliases WHERE 'a & b' @> t; ts_rewrite ------------ 'b' & 'c'
Two operators are defined for tsquery:
TSQUERY @> TSQUERY
Returns true if the right argument might be contained in left argument.
TSQUERY <@ TSQUERY
Returns true if the left argument might be contained in right argument.
To speed up operators <@ and @> for tsquery one can use a GiST index with a tsquery_ops opclass:
CREATE INDEX t_idx ON aliases USING gist (t tsquery_ops);