Prefix Search or Secret PostgreSQL Operators

Preface

It all started when I was scrolling through the PostgreSQL ChangeLog and completely by accident I came across the following entry:

Add prefix-match operator text ^@ textwhich is supported by SP-GiST (Ildus Kurbangaliev)
This is similar to using var LIKE 'word%' with a btree index, but it is more efficient.

https://www.postgresql.org/docs/release/11.0/

It turned out that a special operator for searching by prefixes (at the beginning of a string) appeared, and it was a revelation for me. There was not a word about it in the documentation for version 11. And it was also accelerated by a special index. I contacted the developers a couple of times and now in the chapter “String Functions and Operators” you will find this operator, but nothing is said that its work can be accelerated by an index. And while looking for a link for this article, I came across another mention of this operator.

Allow the ^@ starts with the operator and the starts_with() function to use btree indexes if using the C collation (Tom Lane)
Previously these could only be used SP-GiST indexes.

https://www.postgresql.org/docs/release/15.0/

Since there is very little information about this in the official documentation, and it is scattered across different sections, I decided to write an article about it.

^@ or starts_with()

As you can see in the documentation, the ^@ operator and the starts_with() function are “the same thing”.

root=# select oprcode from pg_operator where oprname="^@";
   oprcode
-------------
 starts_with
(1 строка)

In other words, when PostgreSQL sees the ^@ operator in your code, it replaces it with a call to the starts_with() function. But what about the fact that only operators, not function calls, can be accelerated by indexes?

root=# create table test_spgist(t text not null);
CREATE TABLE
root=#  insert into test_spgist values ('aa'),('ab'),('ba'),('bb');
INSERT 0 4
root=# create index spgist on test_spgist using spgist (t text_ops);
CREATE INDEX
root=# set enable_seqscan=false; -- обратите на это внимание
SET
root=# explain select * from test_spgist where t ^@ 'a';
                                   QUERY PLAN
--------------------------------------------------------------------------------
 Index Only Scan using spgist on test_spgist  (cost=0.13..8.15 rows=1 width=32)
   Index Cond: (t ^@ 'a'::text)
(2 строки)

Disabled seqscan, because on such small tables (4 lines), the planner will consider it more profitable to directly read the table without using the index (seq scan). As you can see, the operator is used to search by index. Let's check the function.

root=# explain select * from test_spgist where starts_with(t,'a');
                                   QUERY PLAN
--------------------------------------------------------------------------------
 Index Only Scan using spgist on test_spgist  (cost=0.13..8.15 rows=1 width=32)
   Index Cond: (t ^@ 'a'::text)
   Filter: starts_with(t, 'a'::text)
(3 строки)

It turns out that if PostgreSQL sees a function call, it replaces it with an operator call. Are you confused? Me too. But this means that this operator and the function are really equivalent. I just don't know if there is any real downside to the fact that with the function the plan looks one line longer. As far as I understand, this means that after selecting the necessary rows by index, there will be another run through the results to double-check with the function. And if there are a lot of results, then this will take a lot of time. And the most important thing is that this is not necessary. But is it?

I'm testing on version 16.3 and btree with COLLATE “C” works here too (Tom Lane added this functionality with version 15).

create table test_btree_c(t text not null);
CREATE TABLE
root=#  insert into test_btree_c values ('aa'),('ab'),('ba'),('bb');
INSERT 0 4
root=# create index btree_c on test_btree_c using btree (t collate "C");
CREATE INDEX
root=# explain select * from test_btree_c where t ^@ 'a';
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Index Only Scan using btree_c on test_btree_c  (cost=0.13..8.15 rows=1 width=32)
   Index Cond: ((t >= 'a'::text) AND (t < 'b'::text))
   Filter: (t ^@ 'a'::text)
(3 строки)
root=# explain select * from test_btree_c where starts_with(t,'a');
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Index Only Scan using btree_c on test_btree_c  (cost=0.13..8.15 rows=1 width=32)
   Index Cond: ((t >= 'a'::text) AND (t < 'b'::text))
   Filter: starts_with(t, 'a'::text)
(3 строки)

Another way to speed up prefix search is described here.
https://www.postgresql.org/docs/17/indexes-opclass.html

It claims it will work with LIKE and regular expressions, but let's check.

create table test_pattern_ops(t text not null);
insert into test_pattern_ops values ('aa'),('ab'),('ba'),('bb');
create index btree_pattern_ops on test_pattern_ops using btree (t text_pattern_ops);
root=# explain (costs false) select * from test_pattern_ops  where t ^@ 'a';
                         QUERY PLAN
-------------------------------------------------------------
 Index Only Scan using btree_pattern_ops on test_pattern_ops
   Index Cond: ((t ~>=~ 'a'::text) AND (t ~<~ 'b'::text))
   Filter: (t ^@ 'a'::text)
(3 строки)

What's interesting is that completely different operators are used here than with a btree index with COLLATE “C”. The mysterious ~>=~ and ~<~. They are missing from the documentation in the section on text operators. I tried googling them, but there is no information about them at all. I only know that there are also operators ~<=~ and ~>~. And they all work with text in a mysterious way.

LIKE or ~~

If an inquisitive reader reads the PostgreSQL documentation in order, he will find the first method of searching by prefix already in the tutorial.
https://www.postgresql.org/docs/17/tutorial-agg.html

It suggests filtering records by prefix using LIKE, for example LIKE 'a%'. And… it's quite possible that if the reader gets so tired of reading that he doesn't read on, this will be the only way to search by prefix that he will know. Can this method use an index? Yes. Let's check all the tables that have already been created.

root=# explain (costs false) select * from test_spgist  where t LIKE 'a%';
                 QUERY PLAN
---------------------------------------------
 Index Only Scan using spgist on test_spgist
   Index Cond: (t ^@ 'a'::text)
   Filter: (t ~~ 'a%'::text)
(3 строки)
root=# explain (costs false) select * from test_btree_c  where t LIKE 'a%';
                      QUERY PLAN
------------------------------------------------------
 Index Only Scan using btree_c on test_btree_c
   Index Cond: ((t >= 'a'::text) AND (t < 'b'::text))
   Filter: (t ~~ 'a%'::text)
(3 строки)
root=# explain (costs false) select * from test_pattern_ops  where t LIKE 'a%';
                         QUERY PLAN
-------------------------------------------------------------
 Index Only Scan using btree_pattern_ops on test_pattern_ops
   Index Cond: ((t ~>=~ 'a'::text) AND (t ~<~ 'b'::text))
   Filter: (t ~~ 'a%'::text)
(3 строки)

As you can see, LIKE is also quite well optimized in version 16. There is a new operator ~~, but there is no secret here, it is the equivalent of LIKE, just like the operator ~~* is the equivalent of ILIKE (case-insensitive comparison). Let's check it out too.

root=# explain (costs false) select * from test_spgist  where t ILIKE 'a%';
                 QUERY PLAN
---------------------------------------------
 Index Only Scan using spgist on test_spgist
   Filter: (t ~~* 'a%'::text)
(2 строки)
root=# explain (costs false) select * from test_btree_c  where t ILIKE 'a%';
                  QUERY PLAN
-----------------------------------------------
 Index Only Scan using btree_c on test_btree_c
   Filter: (t ~~* 'a%'::text)
(2 строки)
root=# explain (costs false) select * from test_pattern_ops  where t ILIKE 'a%';
                         QUERY PLAN
-------------------------------------------------------------
 Index Only Scan using btree_pattern_ops on test_pattern_ops
   Filter: (t ~~* 'a%'::text)
(2 строки)

As you can see, ILIKE does not support indexed search with any index. The index is only used to read the list of values of the entire table.

SIMILAR TO

Another operator that can be used to search by prefix. The documentation describes it as a SQL standard regular expression, something between LIKE and regular (POSIX) regular expressions. Let's check it out.

root=# explain (costs false) select * from test_spgist  where t SIMILAR TO 'a%';
                 QUERY PLAN
---------------------------------------------
 Index Only Scan using spgist on test_spgist
   Index Cond: (t ^@ 'a'::text)
   Filter: (t ~ '^(?:a.*)$'::text)
(3 строки)
root=# explain (costs false) select * from test_btree_c  where t SIMILAR TO 'a%';
                      QUERY PLAN
------------------------------------------------------
 Index Only Scan using btree_c on test_btree_c
   Index Cond: ((t >= 'a'::text) AND (t < 'b'::text))
   Filter: (t ~ '^(?:a.*)$'::text)
(3 строки)
root=# explain (costs false) select * from test_pattern_ops  where t SIMILAR TO 'a%';
                         QUERY PLAN
-------------------------------------------------------------
 Index Only Scan using btree_pattern_ops on test_pattern_ops
   Index Cond: ((t ~>=~ 'a'::text) AND (t ~<~ 'b'::text))
   Filter: (t ~ '^(?:a.*)$'::text)
(3 строки)

Nothing unusual in principle, except that despite the fact that the call SIMILAR TO 'a%' is very similar to the call of the LIKE operator, PostgreSQL converts it into a normal regular expression. This is indicated by the use of the ~ operator, this is the operator for matching with normal (POSIX) regular expressions.

Regular (POSIX) regular expressions or ~

There are two operators for them, ~ is a case-sensitive match, ~* is case-insensitive. You can search by the prefix 'a' using the expression ~ '^a'. Case-insensitive indexes still don't work, so I'll only show case-sensitive search.

explain (costs false) select * from test_spgist  where t ~ '^a';
                 QUERY PLAN
---------------------------------------------
 Index Only Scan using spgist on test_spgist
   Index Cond: (t ^@ 'a'::text)
   Filter: (t ~ '^a'::text)
(3 строки)
root=# explain (costs false) select * from test_btree_c  where t ~ '^a';
                      QUERY PLAN
------------------------------------------------------
 Index Only Scan using btree_c on test_btree_c
   Index Cond: ((t >= 'a'::text) AND (t < 'b'::text))
   Filter: (t ~ '^a'::text)
(3 строки)
root=# explain (costs false) select * from test_pattern_ops  where t ~ '^a';
                         QUERY PLAN
-------------------------------------------------------------
 Index Only Scan using btree_pattern_ops on test_pattern_ops
   Index Cond: ((t ~>=~ 'a'::text) AND (t ~<~ 'b'::text))
   Filter: (t ~ '^a'::text)
(3 строки)

left

What else could it be? Maybe you know other options? Write in the comments. Another thing that comes to mind is left(t,length('a')) = 'a', where 'a', as everywhere above, is the sought prefix. But this, of course, cannot be indexed, because the width of the sought prefix can be different. But let's imagine that we are interested in searching by prefixes of a fixed length. For example, we always search only by the first letter, as in the examples above. This means that it will be possible to create a specialized index by expression.

create table test_btree(t text not null);
insert into test_btree values ('aa'),('ab'),('ba'),('bb');
create index on test_btree using btree (left(t,1));
explain (costs false) select * from test_btree where left(t,1) = 'a';
                     QUERY PLAN
----------------------------------------------------
 Index Scan using test_btree_left_idx on test_btree
   Index Cond: ("left"(t, 1) = 'a'::text)

The obvious advantage of this method is that you can use any btree index for it, without specifying COLLATE “C”. This means that if you create your own collate, where, for example, Ё is equated to E, and uppercase letters are equated to lowercase (a completely realistic example), then you will not be able to use it with all the above methods except this one. The obvious disadvantage is that the index only works for searching fixed-length prefixes.

Testing environment

The description of various methods is just a preamble. It is interesting to find out: which of them is the best? I am testing on a Windows machine (128GiB RAM), in a WSL virtual machine. PostgreSQL 16.3, default settings, I only changed:

shared_buffers=32GB
maintenance_work_mem=4GB

But, instead of describing my hardware in detail, it is much more useful to describe in detail the scripts I tested with, so that you can repeat everything on yours. Creating a test table.

DO $do$
BEGIN 
        CREATE TABLE test (t text NOT NULL);
        FOR l0 IN 0..9 LOOP
        FOR l1 IN 0..9 LOOP
        FOR l2 IN 0..9 LOOP
        FOR l3 IN 0..9 LOOP
        FOR l4 IN 0..9 LOOP
        FOR l5 IN 0..9 LOOP
        FOR l6 IN 0..9 LOOP
            INSERT INTO test SELECT format('%s%s%s%s%s%s%s',
                  l0,l1,l2,l3,l4,l5,l6);
        END LOOP; --l6
        END LOOP; --l5
        END LOOP; --l4
        END LOOP; --l3
        END LOOP; --l2
        END LOOP; --l1
        END LOOP; --l0
END;
$do$;

I know that it would be more elegant to do it through a recursive function call, but it will work faster this way. The result is a table consisting of unique rows of 7 digits. The table size is 10 million rows, 346MiB. The table for storing the results:

CREATE TABLE results (
        method text NOT NULL, -- название метода поиска по префиксу
        prefix_length smallint NOT NULL, -- длина искомого префикса
        start timestamptz NOT NULL, -- время начала теста
        finish timestamptz NOT NULL, -- время завершения
        planning_time real NOT NULL, -- время планирования запроса
        execution_time real NOT NULL -- время выполнения по данным EXPLAIN ANALYZE 
);

The tables are ready, all that remains is to fill them out.

DO $do$
DECLARE
   start_ timestamptz;
   finish_ timestamptz;
   planning_time_ real;
   execution_time_ real;
   prefix text;
   e text; -- вывод EXPLAIN
   methods constant text[] := ARRAY['^@', 'starts_with', 'LIKE', 'SIMILAR', 'regexp']; -- left отдельно
   method_where constant text[] := ARRAY[ -- %параметр от format(), где вставить искомый префикс
        't ^@ %L',
        'starts_with(t, %L)',
        't LIKE ''%s%%''',
        't SIMILAR TO ''%s%%''',
        't ~ ''^%s'''
   ];
   indexes constant text[] := ARRAY['sp_gist', 'collate_c', 'pattern_ops'];
   create_index_suffix constant text[] := ARRAY[
        'USING spgist (t text_ops)',
        'USING btree (t COLLATE "C")',
        'USING btree (t text_pattern_ops)'
   ];
BEGIN
   DROP INDEX IF EXISTS test_index; -- cleanup
   PERFORM count(*) FROM test; -- warm cache
   COMMIT AND CHAIN;
   FOR j in 1..30 LOOP
      FOR i IN 1..array_length(indexes, 1) LOOP
	EXECUTE 'CREATE INDEX test_index ON test ' || create_index_suffix[i];
	 COMMIT AND CHAIN;
	 FOR m IN 1..array_length(methods, 1) LOOP
	    FOR l IN 1..7 LOOP -- длина префикса
	       prefix := repeat('5',l);
	       start_ := clock_timestamp();
	       -- будет работать только в английской локализации lc_messages
	       FOR e IN EXECUTE 'EXPLAIN ANALYZE SELECT * FROM test WHERE ' || format(method_where[m], prefix) || ' LIMIT 1' LOOP
		  IF e ^@ 'Planning Time:'
		  THEN
		     planning_time_ := substring(e FROM 'Planning Time: ([.0-9]*) ms')::real;
		  ELSE IF e^@ 'Execution Time:'
		  THEN
		     execution_time_ := substring(e FROM 'Execution Time: ([.0-9]*) ms')::real;
		  END IF; END IF;
	       END LOOP;
	       finish_ := clock_timestamp();
	       INSERT INTO results (method, index, prefix_length, start, finish, planning_time, execution_time)
		  VALUES (methods[m], indexes[i], l, start_, finish_, planning_time_, execution_time_);
	       COMMIT AND CHAIN;
	    END LOOP;
	 END LOOP; --methods
	 DROP INDEX test_index;
	 COMMIT AND CHAIN;
      END LOOP; --indexes
      -- left method
      FOR l IN 1..7 LOOP -- длина префикса
	 prefix := repeat('5',l);
	 EXECUTE format('CREATE INDEX test_index ON test USING btree (left(t, %L))', l);
	 COMMIT AND CHAIN;
	 start_ := clock_timestamp();
	 -- будет работать только в английской локализации lc_messages
	 FOR e IN EXECUTE format('EXPLAIN ANALYZE SELECT * FROM test WHERE left(t, %L) = %L LIMIT 1', l, prefix) LOOP
	    IF e ^@ 'Planning Time:'
	    THEN
	       planning_time_ := substring(e FROM 'Planning Time: ([.0-9]*) ms')::real;
	    ELSE IF e^@ 'Execution Time:'
	    THEN
	       execution_time_ := substring(e FROM 'Execution Time: ([.0-9]*) ms')::real;
	    END IF; END IF;
	 END LOOP;
	 finish_ := clock_timestamp();
	 INSERT INTO results (method, index, prefix_length, start, finish, planning_time, execution_time)
	    VALUES ('left', 'btree', l, start_, finish_, planning_time_, execution_time_);
	 COMMIT AND CHAIN;
	 DROP INDEX test_index;
	 COMMIT AND CHAIN;
      END LOOP; -- prefix
   END LOOP;
END;
$do$;

I take the time from EXPLAIN ANALYZE, parsing will only work if the localization is specified as English, for example LC_MESSAGES = en_US.UTF-8. I perform 30 series of measurements. For each prefix length from 1 to 7 I make separate entries. The first thing that is interesting is, is there a difference between ^@ and starts_with?

It turns out there is. Starts_with is a confident leader when using all types of indexes. It turns out that there is no reason to use the ^@ operator, not only is it somehow slower (although it uses the same function), but starts_with also provides much better code readability. I threw ^@ out of further consideration, because there is a lot of confusion further on.

But despite the fact that it is very difficult to understand here, if you look closely, you can see the following. Firstly, all methods based on sp_gist (text_ops) clearly lose. Indexes based on btree give a noticeably better result and hold on quite tightly. Secondly, you can still see that the obvious leader is LIKE (index btree text_pattern_ops). Why the universal LIKE operator won over the highly specialized starts_with() function is a mystery to me. I expected it to be the other way around, because with narrow specialization there are great opportunities for optimization. And this discovery contradicts the comment from Ildus Kurbangaliev at the beginning of this article. I think there is still something to work on.