12.1. Introduction

Full Text Searching (or just text search) allows identifying documents that satisfy a query, and optionally sorting them by relevance to the query. The most common search is to find all documents containing given query terms and return them in order of their similarity to the query. Notions of query and similarity are very flexible and depend on the specific application. The simplest search considers query as a set of words and similarity as the frequency of query words in the document. Full text indexing can be done inside the database or outside. Doing indexing inside the database allows easy access to document metadata to assist in indexing and display.

Textual search operators have existed in databases for years. PostgreSQL has ~,~*, LIKE, ILIKE operators for textual datatypes, but they lack many essential properties required by modern information systems:

Full text indexing allows documents to be preprocessed and an index saved for later rapid searching. Preprocessing includes:

Dictionaries allow fine-grained control over how lexemes are normalized. With dictionaries you can:

A data type tsvector is provided for storing preprocessed documents, along with a type tsquery for representing processed queries (Section 8.12). Also, a full text search operator @@ is defined for these data types (Section 12.1.2). Full text searches can be accelerated using indexes (Section 12.5).

12.1.1. What Is a Document?

A document is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, these associations are used to search for documents which contain query words.

For searches within PostgreSQL, a document is normally a textual field within a row of a database table, or possibly a combination (concatenation) of such fields, perhaps stored in several tables or obtained dynamically. In other words, a document can be constructed from different parts for indexing and it might not be stored anywhere as a whole. For example:

SELECT title || ' ' ||  author || ' ' ||  abstract || ' ' || body AS document
FROM messages
WHERE mid = 12;

SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document
FROM messages m, docs d
WHERE mid = did AND mid = 12;

Note: Actually, in the previous example queries, COALESCE should be used to prevent a simgle NULL attribute from causing a NULL result for the whole document.

Another possibility is to store the documents as simple text files in the file system. In this case, the database can be used to store the full text index and to execute searches, and some unique identifier can be used to retrieve the document from the file system. However, retrieving files from outside the database requires superuser permissions or special function support, so this is usually less convenient than keeping all the data inside PostgreSQL.

12.1.2. Performing Searches

Full text searching in PostgreSQL is based on the operator @@, which tests whether a tsvector (document) matches a tsquery (query). Also, this operator supports text input, allowing explicit conversion of a text string to tsvector to be skipped. The variants available are:

tsvector @@ tsquery
tsquery  @@ tsvector
text @@ tsquery
text @@ text

The match operator @@ returns true if the tsvector matches the tsquery. It doesn't matter which data type is written first:

SELECT 'cat & rat'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
 ?column?
----------
 t

SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
 ?column?
----------
 f

The form text @@ tsquery is equivalent to to_tsvector(x) @@ y. The form text @@ text is equivalent to to_tsvector(x) @@ plainto_tsquery(y). Section 9.13 contains a complete list of full text search functions and operators.

12.1.3. Configurations

The above are all simple text search examples. As mentioned before, full text search functionality includes the ability to do many more things: skip indexing certain words (stop words), process synonyms, and use sophisticated parsing, e.g. parse based on more than just white space. This functionality is controlled by configurations. Fortunately, PostgreSQL comes with predefined configurations for many languages. (psql's \dF shows all predefined configurations.)

During installation an appropriate configuration was selected and default_text_search_config was set accordingly in postgresql.conf. If you are using the same text search configuration for the entire cluster you can use the value in postgresql.conf. If using different configurations throughout the cluster but the same text search configuration for any one database, use ALTER DATABASE ... SET. If not, you must set default_text_search_config in each session. Many functions also take an optional configuration name.