PostgreSQL 8.3beta1 Documentation | ||||
---|---|---|---|---|
Prev | Fast Backward | Chapter 12. Full Text Search | Fast Forward | Next |
Full Text Searching (or just text search) allows identifying documents that satisfy a query, and optionally sorting them by relevance to the query. The most common search is to find all documents containing given query terms and return them in order of their similarity to the query. Notions of query and similarity are very flexible and depend on the specific application. The simplest search considers query as a set of words and similarity as the frequency of query words in the document. Full text indexing can be done inside the database or outside. Doing indexing inside the database allows easy access to document metadata to assist in indexing and display.
Textual search operators have existed in databases for years. PostgreSQL has ~,~*, LIKE, ILIKE operators for textual datatypes, but they lack many essential properties required by modern information systems:
There is no linguistic support, even for English. Regular expressions are not sufficient because they cannot easily handle derived words, e.g., satisfies and satisfy. You might miss documents which contain satisfies, although you probably would like to find them when searching for satisfy. It is possible to use OR to search any of them, but it is tedious and error-prone (some words can have several thousand derivatives).
They provide no ordering (ranking) of search results, which makes them ineffective when thousands of matching documents are found.
They tend to be slow because they process all documents for every search and there is no index support.
Full text indexing allows documents to be preprocessed and an index saved for later rapid searching. Preprocessing includes:
Parsing documents into lexemes. It is useful to identify various classes of lexemes, e.g. digits, words, complex words, email addresses, so that they can be processed differently. In principle lexeme classes depend on the specific application but for an ordinary search it is useful to have a predefined set of classes. PostgreSQL uses a parser to perform this step. A standard parser is provided, and custom parsers can be created for specific needs.
Converting lexemes into normalized form. This allows searches to find variant forms of the same word, without tediously entering all the possible variants. Also, this step typically eliminates stop words, which are words that are so common that they are useless for searching. PostgreSQL uses dictionaries to perform this step. Various standard dictionaries are provided, and custom ones can be created for specific needs.
Storing preprocessed documents optimized for searching. For example, each document can be represented as a sorted array of normalized lexemes. Along with the lexemes it is desirable to store positional information to use for proximity ranking, so that a document which contains a more "dense" region of query words is assigned a higher rank than one with scattered query words.
Dictionaries allow fine-grained control over how lexemes are normalized. With dictionaries you can:
Define stop words that should not be indexed.
Map synonyms to a single word using ispell.
Map phrases to a single word using a thesaurus.
Map different variations of a word to a canonical form using an ispell dictionary.
Map different variations of a word to a canonical form using snowball stemmer rules.
A data type tsvector is provided for storing preprocessed documents, along with a type tsquery for representing processed queries (Section 8.12). Also, a full text search operator @@ is defined for these data types (Section 12.1.2). Full text searches can be accelerated using indexes (Section 12.5).
A document is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, these associations are used to search for documents which contain query words.
For searches within PostgreSQL, a document is normally a textual field within a row of a database table, or possibly a combination (concatenation) of such fields, perhaps stored in several tables or obtained dynamically. In other words, a document can be constructed from different parts for indexing and it might not be stored anywhere as a whole. For example:
SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document FROM messages WHERE mid = 12; SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document FROM messages m, docs d WHERE mid = did AND mid = 12;
Note: Actually, in the previous example queries, COALESCE should be used to prevent a simgle NULL attribute from causing a NULL result for the whole document.
Another possibility is to store the documents as simple text files in the file system. In this case, the database can be used to store the full text index and to execute searches, and some unique identifier can be used to retrieve the document from the file system. However, retrieving files from outside the database requires superuser permissions or special function support, so this is usually less convenient than keeping all the data inside PostgreSQL.
Full text searching in PostgreSQL is based on the operator @@, which tests whether a tsvector (document) matches a tsquery (query). Also, this operator supports text input, allowing explicit conversion of a text string to tsvector to be skipped. The variants available are:
tsvector @@ tsquery tsquery @@ tsvector text @@ tsquery text @@ text
The match operator @@ returns true if the tsvector matches the tsquery. It doesn't matter which data type is written first:
SELECT 'cat & rat'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector; ?column? ---------- t SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector; ?column? ---------- f
The form text @@ tsquery is equivalent to to_tsvector(x) @@ y. The form text @@ text is equivalent to to_tsvector(x) @@ plainto_tsquery(y). Section 9.13 contains a complete list of full text search functions and operators.
The above are all simple text search examples. As mentioned before, full text search functionality includes the ability to do many more things: skip indexing certain words (stop words), process synonyms, and use sophisticated parsing, e.g. parse based on more than just white space. This functionality is controlled by configurations. Fortunately, PostgreSQL comes with predefined configurations for many languages. (psql's \dF shows all predefined configurations.)
During installation an appropriate configuration was selected and default_text_search_config was set accordingly in postgresql.conf. If you are using the same text search configuration for the entire cluster you can use the value in postgresql.conf. If using different configurations throughout the cluster but the same text search configuration for any one database, use ALTER DATABASE ... SET. If not, you must set default_text_search_config in each session. Many functions also take an optional configuration name.