how to find happiness with PostgreSQL

Introduction

Full-text search is an integral part of modern applications, especially those that work with large volumes of text information, be it blogging platforms, content management systems, or news aggregators. Whatever application you're developing, adding full-text search capabilities can greatly improve the user experience. In this article, we will look at the main full-text search capabilities PostgreSQL offers, what benefits it provides, and provide example queries.

Before going into detail, let's look at the real problem. Imagine a document management application where users often search for documents using not just keywords, but entire phrases or even complex queries that include synonyms. With traditional search methods based LIKE or regular expressions, such tasks are performed slowly and not always efficiently. This is where PostgreSQL full text search comes in.

Full text search in PostgreSQL: key features

Full text search tasks

Full-text search is designed to find and rank text data based on keywords or phrases found in text fields in a database. One of the key tasks of full-text search is to work effectively with large volumes of texts, where standard approaches such as using the operator LIKE turn out to be insufficient. However, using LIKE we are faced with an obvious list of problems.

Ineffectiveness of simple search

Using the Standard Operator LIKE to search for keywords in large texts leads to poor performance. Each query requires a full check of each text, which becomes unacceptable when working with big data.

SELECT * FROM documents WHERE body LIKE '%keyword%';

This query requires a full traversal of all rows in the table, which slows down execution for large amounts of data.

Morphology-based search

The search must take into account different forms of words. For example, a query for the word “category” should return documents where the forms “categories”, “categories”, etc. occur. Operators like LIKE they don't allow you to do this.

SELECT * FROM documents WHERE body LIKE '%категория%' OR body LIKE '%категории%';

This can be solved by normalizing word forms.

Ranking results

An important aspect is not only finding documents, but also sorting them by relevance. Standard SQL queries do not have built-in support for ranking results based on how close the query words are to each other in the document or how often they occur.

Basic approaches to implementing full-text search in PostgreSQL

PostgreSQL offers a more powerful and flexible full-text search engine than traditional approaches, solving the problems mentioned above with built-in data types, functions and indexes.

Of the main tools that PostgreSQL offers, these are tsvector , tsquery And GIN index.

tsvector data type

This is a data type for storing indexed text. It automatically normalizes text by removing stop words and converting words to their base forms.

To start taking advantage of this data type, simply create a table with a special type of attribute tsvector

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    title TEXT,
    body TEXT,
    tsvector_col TSVECTOR
);

And then fill it with normalized text using the function to_tsvectorwhich converts text to format tsvector.

INSERT INTO documents (title, body, tsvector_col) VALUES 
('Artificial Intelligence', 'Artificial intelligence (AI) is a field of computer science that aims to create intelligent machines.', to_tsvector('english', 'Artificial intelligence (AI) is a field of computer science that aims to create intelligent machines.')),
('Machine Learning', 'Machine learning is a subset of AI that allows systems to learn from data.', to_tsvector('english', 'Machine learning is a subset of AI that allows systems to learn from data.')),
('Deep Learning', 'Deep learning is a subset of machine learning that uses neural networks to model complex patterns.', to_tsvector('english', 'Deep learning is a subset of machine learning that uses neural networks to model complex patterns.')),
('Data Science', 'Data science involves the use of statistical methods to extract insights from data.', to_tsvector('english', 'Data science involves the use of statistical methods to extract insights from data.'));

As a result we get

| id | title                   | body                                                                                                         | tsvector_col                                                                                             |
|----|-------------------------|--------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
| 1  | Artificial Intelligence | Artificial intelligence (AI) is a field of computer science that aims to create intelligent machines.        | 'aim':12 'artifici':1 'creat':14 'field':8 'intellig':4,16 'machin':15 'scienc':10 'ai':6,7              |
| 2  | Machine Learning        | Machine learning is a subset of AI that allows systems to learn from data.                                   | 'ai':8 'allow':11 'data':14 'learn':13 'machin':1 'subset':5 'system':10 'learn':3                       |
| 3  | Deep Learning           | Deep learning is a subset of machine learning that uses neural networks to model complex patterns.           | 'complex':18 'deep':1 'learn':2,7 'machin':8 'model':16 'neural':12 'network':13 'pattern':19 'subset':5 |
| 4  | Data Science            | Data science involves the use of statistical methods to extract insights from data.                          | 'data':1,14 'extract':13 'insight':12 'involv':3 'method':9 'scienc':2 'statist':8 'use':7               |

tquery data type

Data type tsquery used to store and execute search queries. This type supports logical operators such as AND, OR, NOTmaking it flexible and powerful for creating complex queries.

Request using tsquery looks something like this

SELECT * FROM documents WHERE tsvector_col @@ to_tsquery('machine & learning');

This query searches for all documents that contain the words “machine” and “learning” at the same time. Operator @@ used to match a query against indexed text (tsvector).

As a result, we get the following sample

 id |      title       |                                                body                                                |                                                   tsvector_col                                                    
----+------------------+----------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------
  2 | Machine Learning | Machine learning is a subset of AI that allows systems to learn from data.                         | 'ai':7 'allow':9 'data':14 'learn':2,12 'machin':1 'subset':5 'system':10
  3 | Deep Learning    | Deep learning is a subset of machine learning that uses neural networks to model complex patterns. | 'complex':15 'deep':1 'learn':2,8 'machin':7 'model':14 'network':12 'neural':11 'pattern':16 'subset':5 'use':10

Here the search is fast thanks to the use of normalized data in the format tsvector.

Advantages tsquery

  • Logical Operators: You can use operators like & (AND), | (OR), and ! (NOT) for flexible query writing.

  • Prefix queries: For example, a prefix search can be done using * . This will find all words starting with “sci”, such as “science”.

    SELECT * FROM documents WHERE to_tsvector(body) @@ to_tsquery('sci:*');
    
    >>
     id |          title          |                                                 body                                                  |                                            tsvector_col                                             
    ----+-------------------------+-------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------
      4 | Artificial Intelligence | Artificial intelligence (AI) is a field of computer science that aims to create intelligent machines. | 'ai':3 'aim':11 'artifici':1 'comput':8 'creat':13 'field':6 'intellig':2,14 'machin':15 'scienc':9
      7 | Data Science            | Data science involves the use of statistical methods to extract insights from data.                   | 'data':1,13 'extract':10 'insight':11 'involv':3 'method':8 'scienc':2 'statist':7 'use':5
    

What should you pay attention to when working with tsvector and tsquery?

Primarily on how you prepared the data to use these types. And the main thing to take into account here is the configuration that you use in the functions to_tsvector And to_tsquery to convert text data to the appropriate type. Let's take a closer look at these functions.

Function to_tsvector used to convert text to format tsvectorwhich is a structure suitable for full-text search. The function breaks text into tokens (words), normalizes them, and then stores them as indices for search.

Function to_tsquery used to create full-text search queries that operate on a data type tsvector. These queries allow you to find documents containing specific words or combinations of them using Boolean operators.

So, when using these functions in your search queries, you need to ensure that the first parameter responsible for selecting the configuration is used consistently. That is, so that situations do not arise when, for example, in to_tsvector the configuration is for the English locale, and in to_tsquery for Russian locale.

For example, consider the following expressions

SELECT to_tsvector('simple','Это простой государственно-правовой текст для теста. This is a simple text example for testing.');
                                                                                         to_tsvector                                                                                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 'a':11 'example':14 'for':15 'is':10 'simple':12 'testing':16 'text':13 'this':9 'государственно':4 'государственно-правовой':3 'для':7 'правовой':5 'простой':2 'текст':6 'теста':8 'это':1
SELECT to_tsvector('english', 'Это простой государственно-правовой текст для теста. This is a simple text example for testing.');
                                                                      to_tsvector                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------
 'exampl':14 'simpl':12 'test':16 'text':13 'государственно':4 'государственно-правовой':3 'для':7 'правовой':5 'простой':2 'текст':6 'теста':8 'это':1
SELECT to_tsvector('russian', 'Это простой государственно-правовой текст для теста. This is a simple text example for testing.');
                                                             to_tsvector                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 'exampl':14 'simpl':12 'test':16 'text':13 'государствен':4 'государственно-правов':3 'правов':5 'прост':2 'текст':6 'тест':8 'эт':1

It is easy to see that the results of tokenization are strikingly different: the analysis of morphology, the list of stop words, and the support for complex words are different. And this, in turn, will greatly influence search results and indexing efficiency.

For example

SELECT to_tsvector('simple', 'Это простой государственно-правовой текст для теста. This is a simple text example for testing.')
@@ to_tsquery('simple', 'просто & для & examples & test');
>>
 ?column? 
----------
 f

SELECT to_tsvector('english', 'Это простой государственно-правовой текст для теста. This is a simple text example for testing.')
@@ to_tsquery('english', 'просто & для & examples & test');
>>
 ?column? 
----------
 f

SELECT to_tsvector('russian', 'Это простой государственно-правовой текст для теста. This is a simple text example for testing.')
@@ to_tsquery('russian', 'просто & для & examples & test');
>>
 ?column? 
----------
 t

As you can see, as soon as we enabled support for the morphology of both languages ​​(configuration russian inherited from english ) the request was executed using stemming for Russian and English words.

Another useful thing to consider is creating your own configurations. Full-text search configurations consist of several components that can be customized, including dictionary of synonyms, thesaurus, safe words And correct spelling using Ispell.

Dictionary of synonyms allows you to compare different words (synonyms) with one normalized form for a more flexible search. For example, if a query uses the word “car”, it will also find documents containing the word “car”.

To customize this dictionary, create a file with synonyms, e.g. synonyms_russian.samplewhere each line contains a comma-separated list of synonyms

авто, машина, автомобиль
компьютер, ПК, комп

Define a dictionary of synonyms in PostgreSQL

CREATE TEXT SEARCH DICTIONARY synonym_russian (
  TEMPLATE = synonym,
  FILE = synonyms_russian.sample,   -- путь к файлу синонимов
  DICTIONARY = russian_stem         -- использовать стемминг на русском
);

Add this dictionary to your full text search configuration

ALTER TEXT SEARCH CONFIGURATION russian
ALTER MAPPING FOR asciiword, word WITH synonym_russian, russian_stem;

Now, when you search for the words “car” or “car”, you will find documents containing both synonyms.

Thesaurus is an extension of the synonym dictionary that supports more complex mappings between terms, including multi-level synonyms and collocations.

To set up a thesaurus, create a thesaurus file, e.g. thesaurus_russian.sample. This file allows you to match words and phrases to their basic forms:

automobile: авто, машина, автомобиль
computer: ПК, компьютер, комп

Create a thesaurus dictionary in PostgreSQL

CREATE TEXT SEARCH DICTIONARY thesaurus_russian (
  TEMPLATE = thesaurus,
  FILE = thesaurus_russian.sample,   -- путь к файлу с тезаурусом
  DICTIONARY = russian_stem          -- использовать стемминг
);

Apply this dictionary to the configuration

ALTER TEXT SEARCH CONFIGURATION russian
ALTER MAPPING FOR asciiword, word WITH thesaurus_russian, russian_stem;

Now, using a thesaurus, a search for “car” can also find “car”, and a search for “computer” will find both “PC” and “computer”.

Ispell is a spelling correction system that can be used to normalize tokens by correcting spelling errors. It is useful for processing text with possible typos.

To configure it, create a file with Ispell rules for the Russian language. Typically these are two files: one with affixes (for example, russian.affix), another with a dictionary (for example, russian.dict).

Affix file (.aff) defines rules for changing words, such as declension and conjugation. It contains information about suffixes and prefixes that can be applied to words and how word forms change.

Example file russian.affix:

SET KOI8-R
TRY аеиоунстлвркмпдгбзчжцйхфшщэё

PFX A Y 1
PFX A   0   пре     .

SFX B Y 2
SFX B   0   а       [^аеёиоуыэюя]
SFX B   0   ов      [^аеёиоуыэюя]
SFX B   а   е       [^аеёиоуыэюя]
SFX B   ов   0      [^аеёиоуыэюя]

SFX C Y 1
SFX C   0   н       .

SFX D Y 2
SFX D   0   и       [^аеёиоуыэюя]
SFX D   и   о       [^аеёиоуыэюя]

In this example:

  • PFX And SFX denote prefixes and suffixes respectively.

  • A, B, C, D are groups of affixes.

  • Y indicates that the affix rule is active.

  • 0 – means that the suffix or prefix does not change.

  • For example, the prefix A says that the prefix “pre” can be added to a word, and suffixes like B add endings for declension of nouns.

Dictionary file (.dict) contains a list of word roots that can change according to the rules from the affix file.

Example file russian.dict:

машин/B
программ/A
интеллект/C
робот/D
алгоритм
данн/C
систем/AB

In this example:

  • cars/B means that the word “machines” can be changed according to the rules of the suffix group B from an affix file (for example, “car”, “cars”, “car”, etc.).

  • programs/A means that the word “programs” can use prefixes or suffixes from the group A.

  • systems/AB – this word can be used as a prefix from the group Aand suffixes from the group B.

Next, create an Ispell dictionary in PostgreSQL

CREATE TEXT SEARCH DICTIONARY ispell_russian (
  TEMPLATE = ispell,
  DictFile = russian,
  AffFile = russian,
  StopWords = russian
);

Add Ispell dictionary to full text search configuration

ALTER TEXT SEARCH CONFIGURATION russian
ALTER MAPPING FOR word, asciiword WITH ispell_russian, russian_stem;

Now the system will correct spelling errors in the text based on Ispell rules.

Safe words – these are frequently occurring words that do not affect the meaning of the query and, as a rule, are excluded from indexing and searching (for example, prepositions, conjunctions, etc.).

To change this part of the configuration, create a file with a list of stop words, e.g. russian_stopwords.sample.

For example

и
в
на
с
для
к

Define a dictionary of stop words in PostgreSQL

CREATE TEXT SEARCH DICTIONARY stopwords_russian (
  TEMPLATE = simple,
  STOPWORDS = russian_stopwords.sample
);

Add this dictionary to your full text search configuration

ALTER TEXT SEARCH CONFIGURATION russian
ALTER MAPPING FOR asciiword, word WITH stopwords_russian, russian_stem;

Now all specified stop words will be excluded from the indexing process and will not participate in search queries.

And if we now put all this together, then to create your own full-fledged full-text search configuration you need to run something like this script https://postgrespro.ru/docs/postgresql/17/textsearch-dictionaries

CREATE TEXT SEARCH CONFIGURATION my_russian ( COPY = russian );

-- Настройка словаря синонимов
CREATE TEXT SEARCH DICTIONARY synonym_russian (
  TEMPLATE = synonym,
  FILE = synonyms_russian.sample,
  DICTIONARY = russian_stem
);

-- Настройка тезауруса
CREATE TEXT SEARCH DICTIONARY thesaurus_russian (
  TEMPLATE = thesaurus,
  FILE = thesaurus_russian.sample,
  DICTIONARY = russian_stem
);

-- Настройка Ispell для исправления орфографии
CREATE TEXT SEARCH DICTIONARY ispell_russian (
  TEMPLATE = ispell,
  DictFile = russian,
  AffFile = russian,
  StopWords = russian
);

-- Настройка стоп-слов
CREATE TEXT SEARCH DICTIONARY stopwords_russian (
  TEMPLATE = simple,
  STOPWORDS = russian_stopwords.sample
);

-- Применение всех словарей к конфигурации
ALTER TEXT SEARCH CONFIGURATION my_russian
  ALTER MAPPING FOR word, asciiword WITH synonym_russian, thesaurus_russian, ispell_russian, stopwords_russian, russian_stem;

And further, in order to apply this configuration, you must either make it the default configuration, or explicitly specify it in the full-text search functions

SELECT to_tsvector('my_russian', 'Это простой государственно-правовой текст для теста. This is a simple text example for testing.')
@@ to_tsquery('my_russian', 'просто & для & examples & test');

GIN indices

A GIN (Generalized Inverted Index) index is a type of index in PostgreSQL that efficiently handles multi-valued attributes such as arrays, JSON, and most importantly in our case, full-text search queries via tsvector. GIN indexes speed up the process of searching text data, especially when working with morphologically complex languages.

Let's create a GIN index for our field tsvector_colto speed up queries

CREATE INDEX idx_fulltext ON documents USING GIN(tsvector_col);

The GIN index works on the principle inverted indexwhere each word (token) corresponds to a list of documents containing this token. The basic structure of GIN consists of the following elements:

  1. Key (lexeme): Each token from the text represents a key in the index.

  2. Posting list: For each key, a list is created of all documents (or table rows) in which that token appears. In the case of PostgreSQL, for full-text search this list will include references to table rows containing the token.

  3. Tree structure: The GIN index is built as a tree, where each token is associated with documents. This allows you to quickly find the necessary documents by token.

When a GIN index is created on a field tsvector_colthe search process for text data is significantly accelerated due to the following features:

  1. Inverted index: Unlike a standard full table scan, a GIN index stores tokens and their positions as an inverted index. This means that the query can immediately access the desired token and get a list of documents in which it is present, bypassing unnecessary lines.

  2. Quickly match tokens with documents: when the query uses the operator @@,PostgreSQL accesses the GIN index and quickly retrieves all ,documents where the tokens specified in the query occur. For example, request to_tsquery('fox & dog') will search for documents where both tokens (“fox” and “dog”) appear.

  3. Multi-Token Search: if the request contains several keywords (tokens), for example, fox & dogGIN index allows simultaneously search for multiple tokens and their intersections, which significantly speeds up query execution compared to sequential traversal.

Let's look at the request and discuss how the GIN index will work.

SELECT * FROM documents WHERE tsvector_col @@ to_tsquery('fox & dog');

Step 1: Using GIN Index:

Step 2: Document Matching:

  • For each lexeme, the inverted GIN index already contains a list of documents in which it appears. For example:

    • “fox” – documents 1, 2, 5.

    • “dog” – documents 2, 3, 5.

  • PostgreSQL quickly finds the intersection of these lists (documents 2 and 5) and returns only those.

Step 3: Read Data:

  • Once PostgreSQL has found documents with the desired tokens, it extracts the complete rows of data and returns them as the result of the query.

And how will this help us?

  • Fast execution: GIN index allows you to avoid a full table scan (sequential scan) when searching for text data, which is especially useful for large amounts of data. Queries that would take minutes without an index are completed in milliseconds with a GIN index.

  • Less resource usage: By using an index, the amount of data that needs to be read from disk is minimized, reducing the load on the disk system and processor.

  • Complex query support: GIN index supports searching multiple words (tokens) using Boolean operators (for example, AND, OR, NOT). This makes it indispensable for full-text search, where the intersection of multiple tokens is important.

What should you pay attention to?

  1. Slow updating: GIN indexes can be slow to create and update, especially for large tables, making them less suitable for dynamic data that is updated frequently.

  2. Index size: GIN indexes can take up significant space in the database. This can be a problem when resources are limited.

How can the GIN index help us?

For our sign, consider a request as simple as an ax

SELECT * FROM documents WHERE body LIKE '%Artificial%' AND body LIKE '%learning%';

If you look at the query plan, we will see that there is a banal exhaustive search of all records

                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Seq Scan on public.documents  (cost=0.00..19.45 rows=1 width=100) (actual time=0.009..0.020 rows=2 loops=1)
   Output: id, title, body, tsvector_col
   Filter: ((documents.body ~~ '%Artificial%'::text) AND (documents.body ~~ '%learning%'::text))
   Rows Removed by Filter: 27
   Buffers: shared hit=2
 Planning:
   Buffers: shared hit=5
 Planning Time: 0.075 ms
 Execution Time: 0.031 ms

A similar query using tsvector will give the following result.

SELECT * FROM documents WHERE tsvector_col @@ to_tsquery('Artificial & learning');
                                                  QUERY PLAN                                                  
--------------------------------------------------------------------------------------------------------------
 Seq Scan on public.documents  (cost=0.00..175.38 rows=1 width=100) (actual time=0.029..0.128 rows=3 loops=1)
   Output: id, title, body, tsvector_col
   Filter: (documents.tsvector_col @@ to_tsquery('Artificial & learning'::text))
   Rows Removed by Filter: 26
   Buffers: shared hit=2
 Planning Time: 0.048 ms
 Execution Time: 0.135 ms

We see that the use tsvector it didn’t give us anything, we also go through all the documents sequentially, although inside the document the comparison is much more efficient; in our small sample, the second query lost in time, but won in search results, since it turned out to be more complete – we found three relevant documents (which is correct ) instead of two in the first request.

Now let's add the GIN index and see what changes.

CREATE INDEX idx_fulltext ON documents USING GIN(tsvector_col);
SELECT * FROM documents WHERE tsvector_col @@ to_tsquery('Artificial & learning');
                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.documents  (cost=20.25..24.51 rows=1 width=100) (actual time=0.044..0.045 rows=3 loops=1)
   Output: id, title, body, tsvector_col
   Recheck Cond: (documents.tsvector_col @@ to_tsquery('Artificial & learning'::text))
   Heap Blocks: exact=1
   Buffers: shared hit=6
   ->  Bitmap Index Scan on idx_fulltext  (cost=0.00..20.25 rows=1 width=0) (actual time=0.041..0.041 rows=3 loops=1)
         Index Cond: (documents.tsvector_col @@ to_tsquery('Artificial & learning'::text))
         Buffers: shared hit=5
 Planning:
   Buffers: shared hit=4 read=4
 Planning Time: 0.206 ms
 Execution Time: 0.062 ms
(12 rows)

As a result, we see that while maintaining the completeness of the search results (all three documents were found), the costs were sharply reduced cost=20.25..24.51 against cost=0.00..175.38 previously. And this is noticeable even for our small sign. For large volumes of data, this search option will be significantly ahead of the usual one. LIKE .

The BitMap Index Scan you see in the query plans is not an exhaustive row search. Instead, PostgreSQL creates a bitmap of all the rows that match your query and then uses it to fetch the desired rows. This is faster than sequential table scans for several reasons:

  • Minimize I/O operations: PostgreSQL does not directly read all rows at once, but creates a bitmap in memory that only stores information about which rows need to be retrieved.

  • Parallelism: BitMap Scan can be processed in parallel, which further speeds up the process.

Third plan (using GIN index) is the best option in practice, despite slightly higher costs and the use of buffers. Here's why:

  1. Scalability: Although the first plan is faster on small data, it does not scale. The second plan uses a more complex search, but still relies on a full table scan, which makes it slow on large amounts of data.

  2. Using the index: The third plan uses the GIN index, which significantly speeds up searching, especially on large data sets. Full-text search with indexes is typically faster and more efficient than sequential table scans as the data size increases.

  3. More efficient resource management: Bitmap Heap Scan And Bitmap Index Scan optimize memory usage and allow the system to avoid a full table scan, which is critical for performance.

On small volumes of data, the difference is insignificant, but on large data, the third plan with a GIN index will prove to be much more efficient due to faster search and less system load.

How does PostgreSQL rank search results?

Ranking full-text search results is the process of sorting retrieved documents based on their relevance to the query. Unlike a simple search, which simply returns documents containing the searched words, ranking allows you to evaluate how important and useful these documents are to the user, based on how often and how close together the keywords are request. This is especially important in cases where:

  1. The query includes several keywords: If a user is looking for a document with several words, it is not only their presence that is important, but also how closely they are related to each other.

  2. The data volume is large: When there are many documents and they all contain the searched keywords, the user may encounter many irrelevant results. Ranking helps highlight more useful documents.

  3. User Experience: Results ordered by relevance make data navigation much easier and improve the overall app experience.

Imagine a news aggregator where users search for articles about “artificial intelligence” and “machine learning.” The search should return articles in which these two keywords appear most often, preferably close to each other. Basic SQL queries don't have built-in capabilities for this. If ranking is not used, the result may be random articles containing these words, but located far from each other and without semantic connection.

PostgreSQL provides several built-in functions for ranking search results:

  • ts_rank(): Estimates the frequency of occurrence of keywords from a query in a document.

  • ts_rank_cd(): evaluates the “density” of the found keywords, i.e. how close together they are in the document.

Let's perform a full-text search using the keywords “AI” and “machine learning” using ts_rank() to rank results:

SELECT body, ts_rank(tsvector_col, to_tsquery('AI & machine & learning')) AS rank
FROM documents
WHERE tsvector_col @@ to_tsquery('AI | machine | learning')
ORDER BY rank DESC;

As a result we get

                                                 body                                                  |    rank     
-------------------------------------------------------------------------------------------------------+-------------
 Machine learning is a subset of AI that allows systems to learn from data.                            |   0.3400137
 Deep learning is a subset of machine learning that uses neural networks to model complex patterns.    |  0.18152626
 Artificial intelligence (AI) is a field of computer science that aims to create intelligent machines. | 0.021730691

In this example, documents are sorted by relevance: articles where both keywords appear will be higher in the list, and those where the words appear more often will receive a higher rank.

Now let's improve our query with ts_rank_cd()which takes into account not only the frequency of words, but also their proximity to each other

SELECT body, ts_rank_cd(tsvector_col, to_tsquery('AI & machine & learning')) AS rank
FROM documents
WHERE tsvector_col @@ to_tsquery('AI | machine | learning')
ORDER BY rank DESC;

As a result we get

                                                 body                                                  | rank 
-------------------------------------------------------------------------------------------------------+------
 Machine learning is a subset of AI that allows systems to learn from data.                            | 0.02
 Artificial intelligence (AI) is a field of computer science that aims to create intelligent machines. |    0
 Deep learning is a subset of machine learning that uses neural networks to model complex patterns.    |    0

Here, documents where keywords are located close to each other (for example, “Artificial Intelligence and machine learning”) receive a higher rank. This is especially useful if not only the fact of the presence of words is important, but also the context of their use in the text.

When is ranking required?

  1. Search through large volumes of data: When there are thousands or millions of documents in a database, sorting by relevance becomes critical. The user wants to see the most significant documents at the top of the list.

  2. Topic search: If users are searching for information on complex topics, such as scientific research or technology, it is important that results where key tokens are closely related are displayed higher.

  3. Multi-level search: In systems where data is structured at multiple levels (eg title, abstract, body), ranking helps highlight documents where keywords are found in more important parts of the text (eg title).

ts_rewrite function

Function ts_rewrite is a powerful tool that allows you to dynamically rewrite search queries to improve the relevance of results without having to change the index structure. This is especially useful in situations where you need to manage complex queries that include synonyms or qualifying words. The main benefit is the ability to rewrite queries on the fly, making it easier to manage search logic, making it flexible and efficient for end users.

Before the feature appears ts_rewriteto work with synonyms or equivalents, it was necessary to manually add all possible variations of lexemes to search queries. This made queries unwieldy and difficult to manage. For example, if you wanted to find documents where the word “Artificial intelligence” or its synonym “AI” appeared, you had to include all possible options in the query in advance:

For example

SELECT * FROM documents WHERE to_tsvector(body) @@ to_tsquery('(Artificial & intelligence) | AI | ML');
 id |          title          |                                                 body                                                  |                                            tsvector_col                                             
----+-------------------------+-------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------
  1 | Artificial Intelligence | Artificial intelligence (AI) is a field of computer science that aims to create intelligent machines. | 'ai':3 'aim':11 'artifici':1 'comput':8 'creat':13 'field':6 'intellig':2,14 'machin':15 'scienc':9
  2 | Machine Learning        | Machine learning is a subset of AI that allows systems to learn from data.                            | 'ai':7 'allow':9 'data':14 'learn':2,12 'machin':1 'subset':5 'system':10

Disadvantages of this approach:

  1. Queries become complex when there are many synonyms to consider.

  2. Each change to the set of synonyms would require a change in queries or in application logic, which increased the complexity of maintaining the system.

  3. It was necessary to maintain the logic associated with rewriting requests manually in the application.

With the advent of the function ts_rewriteit became possible to identify synonyms and automatically rewrite search queries based on rules without changing the query structure or indexes. This greatly simplifies the work with queries and improves the relevance of results.

How does this work:

  • Rewriting queries: function ts_rewrite allows you to set rules for replacing parts of a query with more precise expressions or synonyms.

  • Simplifying the logic: Instead of having to constantly make changes to queries, you can store synonyms in a separate table and dynamically rewrite queries as they run.

Usage example ts_rewrite


CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery);
  • Let's insert the data into the synonyms table. Here, for example, the synonym “Artificial intelligence” has been added for the word “AI”:

INSERT INTO aliases VALUES(to_tsquery('AI'), to_tsquery('Artificial & intelligence | AI'));
INSERT INTO aliases VALUES(to_tsquery('ML'), to_tsquery('Machine & learning | ML'));
  • Now using the function ts_rewritewe can dynamically rewrite the query:

SELECT ts_rewrite(to_tsquery('AI & ML'), 'SELECT * FROM aliases');
                             ts_rewrite                             
--------------------------------------------------------------------
 ( 'ml' | 'machin' & 'learn' ) & ( 'ai' | 'intellig' & 'artifici' )

Here the query is automatically rewritten: instead of simply searching for the word “AI”, it also searches for documents containing “Artificial intelligence”, without requiring the developer to specify this manually.

SELECT * FROM documents 
WHERE to_tsvector(body) @@ ts_rewrite(to_tsquery('AI | ML'), 'SELECT * FROM aliases');

>>
 id |          title          |                                                 body                                                  |                                                   tsvector_col                                                    
----+-------------------------+-------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------
  1 | Artificial Intelligence | Artificial intelligence (AI) is a field of computer science that aims to create intelligent machines. | 'ai':3 'aim':11 'artifici':1 'comput':8 'creat':13 'field':6 'intellig':2,14 'machin':15 'scienc':9
  2 | Machine Learning        | Machine learning is a subset of AI that allows systems to learn from data.                            | 'ai':7 'allow':9 'data':14 'learn':2,12 'machin':1 'subset':5 'system':10
  3 | Deep Learning           | Deep learning is a subset of machine learning that uses neural networks to model complex patterns.    | 'complex':15 'deep':1 'learn':2,8 'machin':7 'model':14 'network':12 'neural':11 'pattern':16 'subset':5 'use':10

What has become better using ts_rewrite:

  1. Simplifying queries: Instead of having to explicitly include all synonyms and qualifying words, this is done automatically.

  2. Flexibility: ease of changing the set of synonyms without the need to reindex data or rewrite query logic.

  3. Performance Improvement: queries remain simple and more efficient to execute because complex application-level constructs are not required.

Phrase search operators: <-> and

These operators are designed to create queries that look for tokens (keywords) in text that are adjacent to each other or within a specified distance of each other.

Operator <-> used to search for documents where two words follow each other (adjacent tokens).

For example

SELECT to_tsvector('english', 'fat cat'), to_tsvector('english', 'fat cat') @@ to_tsquery('fat <-> cat');

Result

   to_tsvector   | ?column? 
-----------------+----------
 'cat':2 'fat':1 | t

This request will return truebecause the words “fat” and “cat” follow each other.

Operator <N> used to search for documents where two words appear in a given order and are exactly apart N .

For example


SELECT to_tsvector('english', 'fat rat and cat'), to_tsvector('english', 'fat rat and cat') @@ to_tsquery('fat <3> cat');

Result

       to_tsvector       | ?column? 
-------------------------+----------
 'cat':4 'fat':1 'rat':2 | t

The request will return truesince “fat” precedes “cat” and they are at a distance of 3 ('cat':4 - 'fat':1).

tsquery_phrase functions

Function tsquery_phrase() provides convenient syntax for working with phrase searches. It makes queries clearer and provides additional flexibility when setting word spacing.

tsquery_phrase(query1 tsquery, query2 tsquery) Creates a phrasal query equivalent to the operator <->.

For example

SELECT to_tsvector('english', 'fat cat'), to_tsvector('english', 'fat cat') @@ tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'));

Result

   to_tsvector   | ?column? 
-----------------+----------
 'cat':2 'fat':1 | t

This request will return truesince “fat” and “cat” follow each other.

tsquery_phrase(query1 tsquery, query2 tsquery, distance integer) allows you to set the distance between two words.

For example

SELECT to_tsvector('english', 'fat rat and cat'), to_tsvector('english', 'fat rat and cat') @@ tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 3);

Result:

       to_tsvector       | ?column? 
-------------------------+----------
 'cat':4 'fat':1 'rat':2 | t

The request will return truesince between “fat” comes before “cat” and is located at a distance of 3 (cat':4 - 'fat':1).

Instead of a conclusion

Full-text search is an important element of modern applications that work with large volumes of text. In this article, we've covered just a small portion of the full-text search capabilities that PostgreSQL offers to improve performance and user experience.

Simple methods like LIKEare ineffective for searching large texts, especially when users need to search for phrases or synonyms. PostgreSQL offers powerful tools such as tsvector, tsqueryand GIN indexes, which speed up searches by taking morphology into account and providing the ability to rank results by relevance.

What to read?

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *