Search by arbitrary parameters

Sometimes (often) during the development of a website, it becomes necessary to implement a filtering search, and sort the results by some fixed field: for example, searching for goods in an online store, searching for tours in a travel agency, displaying logs with filtering by content, etc. It often happens that filtering should be carried out by almost any field (and there are dozens of fields), and there are thousands or even millions of records. If there is a lot of data, or if you need to update them often, then you cannot create an index for each field, because they will take up a lot of space, or they will create too much disk load when writing, and you have to come up with something. Let’s think of something.
Trivial case
Let’s first consider the situation when there is little data, there are also few read and write requests, and there are no special requirements for speed either. In this case, you can simply use a database (MySQL, Postgres, SQLite, whatever) and filter and sort with simple WHERE and ORDER BY. You can not even create indexes, or create an index only for the field by which sorting will be carried out (hereinafter, this field will be called timestamp for simplicity, because often sorting is needed precisely by time).
Lots of reading, little writing, response time is not very important
The following situation is quite typical: we have quite a lot of records (for example, goods in an online store), we need to filter by arbitrary fields, and, accordingly, indexes for individual fields will take up a lot of space and not the fact that they will be very effective. Suppose we have a rather strange online store, and we need to show first of all not the cheapest products, but the newest ones, and accordingly we can create an index by timestamp and filter records with queries like the following:
SELECT * FROM products
WHERE
name LIKE '%пылесось%' AND
(price BETWEEN 1 AND 200) AND
brand NOT IN ('рога', 'копыта') -- индекс тут не создать
ORDER BY timestamp DESC
LIMIT 100
In essence, the search will always go from top to bottom of the timestamp index and look at each row until there are 100 entries that satisfy the search criteria. The problem with this naive approach is that the rows in the database will not be physically ordered by timestampbut will most likely be in the order they are inserted into the database, so such a query will, in the worst case, perform random reading from the file system to each line, which is very inefficient (for example, in my benchmarks, SQLite can only read about 500,000 lines per second in this way, per processor core). The second problem will await you if suddenly the products table is forced out of the cache and you have to read each line by random reading from the disk: even an SSD is unlikely to give you more than a couple of hundred thousand random reads per second, but most likely less, especially if your disk is networked, as is often the case in clouds. I don’t even want to mention the HDD: here you can’t expect more than a couple of hundred random reads from the disk.
Clustered index
Many DBMS support so-called clustered indexes, when data is stored together with the primary key in the form of a B-tree. If you create a composite clustered index that contains enough fields to be unique, and the first field is timestampthen the data will be clustered by timestamp, and reading from the table with queries like ORDER BY timestamp will be very efficient (20 million lines per second per core in the case of SQLite), and will go in ascending (or descending) order, which allows you to stop the search as soon as the required number of records is found. If the table is not in the cache, then reading from disk will go in relatively large blocks, which is much more efficient than reading one line at a time.
In MySQL, the primary key of the InnoDB engine is always a clustered index, while in SQLite, a clustered index can be created using the WITHOUT ROWID option. Your vaunted PostgreSQL, by the way, does not support clustered indexes out of the box.
Disadvantages of a Clustered Index
In addition to the obvious disadvantage that the set of clustered index fields must be unique (i.e., AUTO_INCREMENT will not work), there is a less obvious one: insert speed. Because data is always maintained in a semi-sorted state in a B-tree, inserting unsorted data will result in a large number of blocks being rewritten on disk (rather than small blocks just for the changed column). The clustered index will not allow many rows per second to be inserted unless the data is in ascending order of the primary key (and this cannot always be guaranteed).
Lots of reads, few writes, response time is minimal
Suppose we have a situation with the same online store, where for some reason the goods are shown sorted by time, but we only have a lot of goods directly, and there are also a lot of read requests, but there are a relatively limited number of possible filter values. If you have used any Amazon, you may have noticed that usually filtering consists of a relatively small number of brands, ready-made ranges of values \u200b\u200b(like screen diagonal, etc.), filters like “rating up to 4.0”, etc. Amazon has millions of products, but there are relatively few filter values to choose from, within a couple of hundred different values (but each of the filters is optional).
For such a case, the so-called bitmap index is suitable. When using bitmap index, you create an array in memory already sorted by the field we need (in this case, timestamp desc), and this array consists of bit sets. The value of each bit corresponds to some possible value of the filter, and is set to 1 if the product falls under the filter, and 0 if it does not.
Example:
0001110010101101101 <- первый элемент массива
0001001001100101111 <- второй элемент массива
...
^ четвертый бит равен 1, если "диагональ экрана от 15'' до 23''"
^ шестой бит равен 1, если "средний рейтинг товара выше 4.0"
^ седьмой бит равен 1, если "средний рейтинг товара выше 4.5"
Searching through such an array in memory is done by creating a search mask (for example, if we want to find products with a rating above 4.5 and a diagonal from 15” to 23”, then we will create a mask 00010010 – the fourth and seventh bits are set), and using the usual logical “AND” (operator &) we filter out values that are not equal to the mask after applying the “AND”:
mask = 0x12;
for (...) {
if ((bitmap[i] & mask) != mask) continue;
...
}
Bitmap indexes take up very little memory space, and the code vectorizes well, so it is quite realistic to get a scan rate of several lines per processor cycle, or, in other words, several billion lines per second to the processor core.
Disadvantages of a bitmap index
A bitmap index is only good when there are relatively few possible filter options. Also, it is quite obvious that such an index will not work if it is necessary to filter by a row. Because the data in the array must already be sorted, the index needs to be rebuilt periodically to get the most recent search results.
Lots of reading, lots of writing, fast response time
You’ve been waiting for this moment, haven’t you :)? Use ClickHouse with Primary Key by timestamp. In ClickHouse, the primary key only sets the sort order, so there is no need to ensure uniqueness. ClickHouse is very fast for both reading and inserting, and at the same time compresses data well. Make a (materialized) column for each field you need to search and you’ll be fine.