Understanding Vespa. Part 1

Content

This article opens a three-part series on working with the Vespa data storage search system.

From this article you will learn:

  • How to run Vespa configuration server in Docker.

  • How to configure Vespa.

  • What does the structure of the data schema look like?

  • How to filter fields in search results.

  • How to disable schema and configuration file validation for local debugging.

In the following parts, we will discuss how searching, ranking and grouping are organized in Vespa, and also compare the speed of CRUD operations in ElasticSearch and Vespa.

Search engines

At the moment, there are many different search engines that are used to store and search data with relevant results. According to the rating of the company solid IT, the leading position is occupied by ElasticSearch, which is significantly ahead of its closest competitors.

A full list of search engines and information on how rankings are calculated is available link.

If you take a closer look at ElasticSearch, it becomes obvious why this system is so popular:

  • Open source.

  • Powerful API – ready-made clients are available in many popular programming languages, including Java and Python.

  • Good horizontal scalability.

  • A flexible data model that allows you to not use schemas and process documents with different fields and structures. Although the ability to create schemas (mapping) remains.

  • Data retrieval is performed using a widely used format – JSON.

  • Data replication allows to increase fault tolerance of systems.

But ElasticSearch has one major drawback: all documents are immutable, and when attempting a partial update, the entire document is completely reindexed. This significantly slows down the loading of large amounts of data.

One of the alternatives to ElasticSearch is Vespa. Vespa is a system for high-load full-text search systems with filtering and ranking of the initial result. Vespa uses a strict structured data model, which is described in the schema.

Vespa addresses the main drawback of ElasticSearch by providing the ability to process large data sets with low latency and partial document updates.

It is worth clarifying that Vespa is a compromise system and it has a number of disadvantages:

  • Unfortunately, at the time of writing, Vespa does not have a full-fledged search client in modern languages ​​such as Java and Python. Vespa developers recommend using any available HTTP clients for searching.

  • Complex two-stage deployment.

  • Due to the fact that the documentation is not complete enough, and there is not much accumulated knowledge about the practical aspects of the work, difficulties arise when trying to understand the details of the Vespa's operation (this article will partially correct this). For example, finding information about the operation of the keyword from disk — it's a whole quest.

  • The data schema is stored in a special .sd (schema definition) format.

Test project

For the test project, we will use as a basis a database containing information about products in a shoe store. In the first stage, the database will contain two categories of shoes: sneakers and boots. There will also be an entity that reflects information about the availability of each of these categories in each store.

The characteristics of the products were chosen to be as simple and understandable as possible for the readers of the article.

Let's describe our products in a data diagram:

Data schema

Data schema

Vespa CLI

Before you start working with Vespa, you need to install Vespa CLI. Below is the current link to the release version:

To work with Windows, you need to download the archive and add an environment variable to the folder with the executable file. If everything is done correctly, the command line will be able to use the command “vespa [cmd]”.

Demo project

You can clone the product repository demo project from the link below:

Vespa in Docker

To deploy Vespa in Docker, simply download the image with the Vespa configuration server. An example of the setup can be found in the docker-compose file:

docker-compose.yml

version: "3.8"
name: vespa
services:
  vespa:
    container_name: vespa
    image: vespaengine/vespa
    ports:
      - "8080:8080"
      - "19071:19071"

Vespa Tuning

Vespa has several ways to deploy settings, but in my opinion the most convenient way is to use the maven plugin. To do this, create an empty project. Then add the build plugin and the library for working with the Java client:

vespa-config/pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

    <modelVersion>4.0.0</modelVersion>

    <parent>
        <groupId>ru.sportmaster</groupId>
        <artifactId>vespa</artifactId>
        <version>1.0-SNAPSHOT</version>
    </parent>

    <artifactId>vespa-config</artifactId>
    <!-- Специальная упаковка пакета для Vespa -->
    <packaging>container-plugin</packaging>

    <dependencies>
        <!-- Библиотека для работы с Vespa -->
        <dependency>
            <groupId>com.yahoo.vespa</groupId>
            <artifactId>container</artifactId>
            <version>${vespa.version}</version>
            <scope>provided</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <!-- Плагин используется для сборки и упаковки компонентов Vespa в контейнер -->
            <plugin>
                <groupId>com.yahoo.vespa</groupId>
                <artifactId>bundle-plugin</artifactId>
                <version>${vespa.version}</version>
                <extensions>true</extensions>
                <configuration>
                    <!-- В случае наличия предупреждений, будет ошибка сборки -->
                    <failOnWarnings>true</failOnWarnings>
                </configuration>
            </plugin>
            <!-- Архивирует компоненты Vespa -->
            <plugin>
                <groupId>com.yahoo.vespa</groupId>
                <artifactId>vespa-application-maven-plugin</artifactId>
                <version>${vespa.version}</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>packageApplication</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

</project>

Vespa server configuration is done using the services.xml file. Example of a simple configuration for a user document:

src/main/application/services.xml

<?xml version="1.0" encoding="utf-8" ?>
<services version="1.0">

    <container version="1.0" id="default">
        <!-- Включает поисковую часть контейнера, без него не будет работать поиск -->
        <search/>
        <!-- Включает API для работы с документами -->
        <document-api/>
    </container>

    <!-- Создает кластер содержимого который хранит и индексирует документы -->
    <content id="product" version="1.0">
        <!-- Определяет какие типы документов должны быть направлены в этот кластер -->
        <documents>
            <document mode="index" type="sneakers" />
            <document mode="index" type="boots" />
        </documents>
        <!-- Количество реплик документа в кластере -->
        <redundancy>2</redundancy>
        <!-- Определяет набор узлов в кластере -->
        <nodes>
            <!-- distribution-key - идентификатор узла для алгоритма распределения данных -->
            <node hostalias="node-1" distribution-key="0"/>
            <node hostalias="node-2" distribution-key="1"/>
            <node hostalias="node-3" distribution-key="2"/>
        </nodes>
    </content>
  
</services>

This configuration will create 3 nodes within the cluster. It is important to note that the redundancy property determines the number of copies of data stored on each node. In this case, the document and its copy will be stored on 2 of the 3 available nodes.

Example of data cluster with redundancy = 1

Example of data cluster with redundancy = 1

In the attribute document.mode the data storage mode is specified, three options are available:

  • index — indexing mode, the document will be available for searching.

  • store-only — normal storage mode, search and indexing are not available.

  • streaming — streaming mode allows you to search for information on raw data if the search is performed on a small amount of information. In this case, indexing may not be very effective, as it requires additional resources to create and maintain indexes. Streaming mode allows you to find information quickly and without the need to create indexes.

Attribute document.type must point to a document from the data schema (see below).

Data schema

As I mentioned earlier, the data schema is stored in a special .sd format.

As an example, let's take the data schema of an abstract document that describes the general properties of shoes:

src/main/application/schemas/shoes.sd

# Схема обуви
schema shoes inherits product {
    # Документ описывающий общие поля для любой обуви
    document shoes inherits product {
        # Сезон
        field season type string {
            indexing: summary | index
        }
        # Материал изготовления
        field material type map<string, int> {
            indexing: summary
            struct-field key {
                indexing: attribute
            }
            struct-field value {
                indexing: attribute
            }
        }
        # Пол
        field gender type string {
            indexing: summary | index
        }
    }
}

It is worth noting that we cannot save this abstract document in Vespa, since it is not listed in content.documents server configurations – services.xml.

Using a keyword inherits we inherit the basic fields of the product document.

The fields in the schema support the following data types:

  • bool, byte, double, float, int, long — simple types. Do not support indexing.

  • array — an array of simple types or data structure. Each element of this type is indexed separately.

  • map — associative array. Does not support indexing.

  • position — coordinates by latitude and longitude. Does not support indexing.

  • predicate — a field with a set of logical constraints. Indexed in binary format.

  • raw — binary data. Does not support indexing.

  • reference — a field with a link to a global document. Prevents indexing at the application deployment level.

  • annotationreference — a field with a link to the annotation.

  • string — string. Indexed.

  • struct — the field type can be any data structure. Does not support indexing.

  • tensor — a tensor type field. Indexed.

  • uri — field for matching with URL. Supports indexing with address parsing into components.

  • weightedset — a field where each value is given a weight. Indexed.

The property is worth discussing separately. indexingwhich specifies the indexing type. There are three options to choose from:

  • index — for unstructured text. It creates a text index and stores the parsed string — tokens — in it. This allows searching by tokens. By default, the index name matches the field name.

  • attributes — for structured data. Makes the field available for sorting, grouping, and ranking. Allows searching by exact match.

  • summary — adds a field to the document summary (see below).

  • set_language — the ability to set the language for the string analyzer. By default, OpenNLP is used for tokenization, which supports several languages: English, German, French, Spanish and Italian. However, it is possible to use Lucene Linguistics. We will discuss this in more detail in the second article, dedicated to search.

The most interesting thing is that these types can be combined using the | symbol. If you specify everything at once: summary | index | attribute, then the index type will be used.

Document Summary

Let's talk a little about the document summary. Basically, it's just information about which documents in what form should be presented in the search result. By default, the default summary is available, which can be enabled using the HTTP request parameter presentation.summary. It will display all fields that have summary in their indexing types. But it is also possible to add a description of your own summaries in the data schema:

src/main/application/schemas/sneakers.sd

# Сводка документов, для использования нужно добавить к запросу - "presentation.summary": "demo-summary"
document-summary demo-summary {
    # Переименовывает поле pavement в pavement_demo_rename
    summary pavement_demo_rename {
        source: pavement
    }
    from-disk
}

Thus, you can, for example, change the resulting information for a sneaker search:

POST /search/ HTTP/1.1
Host: localhost:8080
Content-Type: application/json
Content-Length: 97

{
    "yql": "select * from sneakers where true",
    "presentation.summary": "demo-summary"
}

The response will display only the field with the alias pavement_demo_rename, the rest of the fields will be hidden:

{
    "root": {
        "id": "toplevel",
        "relevance": 1.0,
        "fields": {
            "totalCount": 1
        },
        "coverage": {
            "coverage": 100,
            "documents": 1,
            "full": true,
            "nodes": 3,
            "results": 1,
            "resultsFull": 1
        },
        "children": [
            {
                "id": "index:product/2/c4ca42387a14dc6e295d3d9d",
                "relevance": 0.0,
                "source": "product",
                "fields": {
                    "sddocname": "sneakers",
                    "pavement_demo_rename": "ASPHALT"
                }
            }
        ]
    }
}

Validating settings during deployment

Let's say we've described the server configuration and product data schemas. Then we add some products, and in one of the schemas we need to change the field type:

src/main/application/schemas/boots.sd

Было:
  
  field moisture type bool {
    indexing: summary | attribute
  }

Стало:
  
  field moisture type string {
    indexing: summary | index
  }

In this case, Vespa will not be able to apply the settings correctly, because we have changed not only the element type, but also the indexing type. Vespa does not know how to process already saved and indexed elements. This can affect the integrity of the data, for this reason, by default, Vespa will return an error:

Error: invalid application package (400 Bad Request)
Invalid application:
indexing-change:
Document type 'boots':
Field 'moisture' changed:
add index aspect, matching:
'word' -> 'text', stemming:
'none' -> 'best', normalizing:
'LOWERCASE' -> 'ACCENT', summary field 'moisture' transform:
'attribute' -> 'none'

However, for local work this may not be so important, as we do not always need data integrity for debugging. Therefore, we can disable some rules by creating a special file that will override the validation:

src/main/application/validation-overrides.xml

<validation-overrides>
    <!-- Изменение типа индексирования полей в схеме данных -->
    <allow until="2024-03-07" comment="Для локальной работы">indexing-change</allow>
    <!-- Изменение режима индексирования (services.xml) -->
    <allow until="2024-03-07" comment="Для локальной работы">indexing-mode-change</allow>
    <!-- Изменение типа данных в схеме данных -->
    <allow until="2024-03-07" comment="Для локальной работы">field-type-change</allow>
    <!-- Изменение типа тензора -->
    <allow until="2024-03-07" comment="Для локальной работы">tensor-type-change</allow>
    <!-- Значительное уменьшение (>50%) ресурсов узла -->
    <allow until="2024-03-07" comment="Для локальной работы">resources-reduction</allow>
    <!-- Удаление кластера данных или изменение его идентификатора (services.xml) -->
    <allow until="2024-03-07" comment="Для локальной работы">content-cluster-removal</allow>
    <!-- Изменение глобального атрибута в кластере данных -->
    <allow until="2024-03-07" comment="Для локальной работы">global-document-change</allow>
    <!-- Изменение глобальной точки входа -->
    <allow until="2024-03-07" comment="Для локальной работы">global-endpoint-change</allow>
    <!-- Увеличение избыточности данных -->
    <allow until="2024-03-07" comment="Для локальной работы">redundancy-increase</allow>
    <!-- Избыточность, равная одному, недопустима -->
    <allow until="2024-03-07" comment="Для локальной работы">redundancy-one</allow>
    <!-- Удаление сертификата -->
    <allow until="2024-03-07" comment="Для локальной работы">certificate-removal</allow>
</validation-overrides>

Attribute allow.until indicates the last day on which this rule is valid. The maximum period of validity of a rule is 30 days.

Conclusion

This completes the setup and assembly of the simplest server with Vespa. With the current settings, we can use Vespa to store and read documents.

In the next article we will discuss:

  • What is the difference between DocumentProcessor and QueryProcessor and what tasks do they perform?

  • How does a text tokenizer work?

  • How to rank, group and search by specified conditions.

  • Let's compare the search speed for attribute fields with and without quick search.

  • And much more.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *