Vector DBMS and other tools for developing ML models

With the development of generative and large language models, vector databases are gaining momentum. Last time on the blog beeline cloud we discussed how sustainable this trend is, and also suggested several books for those who want to dive into the topic. Today we have put together a compact thematic selection of open DBMSs and search engines that can help in the development of AI systems. We discuss tools such as Lantern, LanceDB, CozoDB, ArcadeDB, Dart Vector DB, Marqo And Orama.

Image - Joshua Sortino - Unsplash.com

Image – Joshua Sortino – Unsplash.com

Lantern

Let's go in order, Lantern is a vector extension for PostgreSQL, tailored for the design of AI systems. It can be used to generate embeddings and perform vector searches. Lantern developed on a high-performance engine search, which uses an approach called Hierarchical Small World (HNSW). It speeds up fuzzy search algorithms (also known as similarity search) due to its structure. Each layer is himself a graph. The zero layer contains all objects, and subsequent layers contain smaller and smaller samples. When searching, the algorithm selects a vertex in the graph of the top layer, quickly finds candidates close to the query, and then resumes searching for them in other layers.

What's interesting, Lantern allows build indexes outside of the main database and then import them as files. This way, the database is not overloaded during indexing, which improves overall performance. At the same time, the extension introduces a new type of index for vector columns – lantern_hnswwhich speeds up the SQL statement ORDER BY…LIMIT.

Lantern supports multiple distance functions and has two modes:

  • lantern.pgvector_compat=TRUE. It is assigned by default and invites the developer to choose the function independently: Euclidean distance (<->), cosine distance (<=>), Hamming distance (<+>);

  • lantern.pgvector_compat=FALSE. In this case, the tool will automatically select the search distance function.

In general, the developers position Lantern as an alternative pgvector. This is an extension that adds vector search to Postgres. But they claim that their product generates indexes faster and more accurately. For example, indexing a data array sift took Lantern is eight minutes, and pgvector is almost an hour.

It is worth noting here that some residents of Hacker News expressed doubts about the comparison results. In their opinion, high performance is, first of all, a merit search, not the authors of Lantern. On the other hand, an independent assessment confirmedthat Lantern does build indexes at least twice as fast (although pgvector still exhibits lower response times).

In any case, the project continues to develop. The developers plan to introduce hardware CPU acceleration for query processing, automatic index selection, version control support and the ability to conduct A/B testing of embeddings.

LanceDB

Vector DBMS based on columnar memory format Lance and designed for the development of high-performance AI systems. The source code is written in Rust and is distributed under the Apache-2.0 open license.

LanceDB compatible with the LangChain and LlamaIndex ecosystems, as well as the popular PyTorch and TensorFlow frameworks. At the same time, the DBMS supports vector and full-text searches, as well as zero-copy technology. It describes operations during which the processor does not copy data from one memory area to another, but works with a cache or direct memory access.

A distinctive feature of LanceDB is its support for GPU acceleration. DBMS developers held a series of benchmarks and claim that the technology can speed up indexing by 20–26 times compared to data processing on a central processor.

Residents of Hacker News have shown interest in LanceDB, although expressed doubt about the need for a specialized DBMS for the development of high-performance AI systems. According to some site users, standard tools with vector search capabilities are sufficient for these purposes. On the other hand, users note the convenience of LanceDB for developers, since the product supports the most common tools for working with LLM out of the box.

If you want to get acquainted with this DBMS, the official documentation would be a good place to start. The authors also offer various tutorials and “recipes» for designing AI models. For example, they on practice show how to launch your own Q&A bot to respond based on text transcripts of YouTube videos.

CozoDB

It is a transactional relational DBMS that offers vector functionality. In new versions, users can create HNSW indexes. What's interesting is that CozoDB built-in function “time travel” Thus, the DBMS stores the entire history of changes, including previous values ​​of variables. Access to them remains even if they have been overwritten.

CozoDB is possible launch on various OS, including iOS, Android. The developers managed to achieve such flexibility through switchable engines SQLite, RocksDB, Sled and TikV. Overall the project received a large number of positive comments at the time of release. Thus, Hacker News noted the successful Python-interfaceas well as the choice of license (MPL 2.0), which allows the use of the DBMS in commercial projects.

ArcadeDB

Multimodal DBMS that supports SQL, Cypher, Gremlin, HTTP/JSON, MongoDB and Redis. It was created as a fork of the popular OrientDB. In their own version, the authors focused on high performance. ArcadeDB written in LLJ – Low-Level-Java. During operation, the DBMS practically does not fill the dynamic memory with objects; accordingly, there is no need for their frequent disposal. In addition, it uses an optimized JVM, and the core software is designed to run efficiently on multiple processors.

Vector model uses HNSW algorithm based on Java library Hnswlib. ArcadeDB can be deployed in the cloud using Docker or Kubernetes, as well as on-premises. The DBMS is distributed under the Apache-2.0 open license.

Marqo

Marqo is a search engine for text and images that is developed under the Apache-2.0 open license. The DBMS operates in the format documents in, documents out, as the authors themselves call it. In this case, the indexing and search APIs accept and return data in document format. And Marqo is responsible for the full cycle of working with vectors: embedding formation, metadata storage and inference.

The DBMS supports popular frameworks for machine learning and AI system development. So, integration with Haystack allows you to use Marqo as a document repository for RAG, answering questions, or searching for documents. At the same time, Marqo simplifies data flow management for data scientists working with frameworks like Hamilton. There is also built-in support for ONNX (Open Neural Network Exchange), tool for migrating machine learning models from one framework to another (for example, from PyTorch to TFLite).

Marqo is already being used in real projects. One engineer developed tool for searching the multilingual legal framework of the European Union. The main function of the project is the ability to find similar laws written in other languages. In the case of traditional SQL, you would have to manually add a system for automatically translating document texts into English, which would invariably lead to inconsistencies. Marqo removes this limitation thanks to vector technology and allows you to find similar legislation in different languages.

Image - Rene Böhmer - Unsplash.com

Image – Rene Böhmer – Unsplash.com

The authors of Marqo also demonstrate scenarios for using their tool. For example, usage in conjunction with ChatGPT to design applications in Q&A format.

Orama

Another search engine for vector and hybrid search. It is written in JavaScript and has no dependencies, meaning it can be deployed on almost any OS. Orama uses an algorithm BM25 to evaluate and rank relevant documents, which is similar to TF-IDF. In addition, the engine allows you to search by complete coincidence (exact property). Orama also has a function geosearch, which allows you to filter results by distance from the selected location or by bounding area. The default basis is Haversine formula. It is fast, but assumes that the Earth is a perfect spherical shape, so its accuracy decreases over large distances. But Orama can use Vincenty's formula.

In new versions, developers added system of official and user add-ons. So, by creating your own plugin, one podcast host implemented search the transcript database on your website. Another engineer used Orama when development tool for searching JavaScript code elements. According to the author, finding the right function or variable, especially in large projects, can be extremely problematic. Orama performs full-text search using filters, which makes the task easier. The source code of his project can be viewed Here.

Dart Vector DB

Vector DBMS for applications on Android and iOS, written in the Dart language. This project is being developed by a team of three enthusiastic engineers. The authors were not satisfied with the existing solutions for working with vectors on both platforms, so they developed their own – Dart Vector DB for applications built on Flutter.

The authors claim that users' confidential information never leaves their device. In addition, all processes take place locally; there is no server provided. You can use OpenAI embeddings (added with a few lines of code) or generate your own. The tool appeared relatively recently and has not yet gained a community. But it is under the open Apache-2.0 license, so anyone can contribute.

beeline cloud – secure cloud provider. We develop cloud solutions so that you provide your customers with the best services.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *