What is smart contract indexing in Web3 development? (No prior knowledge required)

Hello everyone, I decided to translate my English-language articlein which I compiled the knowledge gained during a year of work in a web3 infrastructure provider about data on EVM blockchains and developer tools for accessing them.

It’s hard to say that the culture of data engineering is deeply ingrained in the Web3 developer community. And not every developer can easily define what indexing means in a Web3 context. I would like to clarify some details on this topic and talk about a tool called The Graph, which has become the de facto industry standard for accessing data on the blockchain for the creators of DApps (decentralized applications).

A picture in which I tried to depict the abyss between the two worlds of developers, but no one understood my idea :)

A picture in which I tried to depict the abyss between the two worlds of developers, but no one understood my idea 🙂

Let’s start with indexing

Индексация в базах данных - это процесс создания структуры данных, которая сортирует и организует данные в базе данных таким образом, чтобы запросы на поиск могли выполняться эффективно. Создав индекс в таблице базы данных, сервер базы данных может быстрее искать и извлекать данные, соответствующие критериям, указанным в запросе. Это помогает улучшить производительность базы данных и сократить время, необходимое для извлечения информации.

But what about indexing on blockchains? The most popular blockchain architecture is the EVM (Ethereum Virtual Machine).

Ethereum Virtual Machine (EVM) - это среда выполнения, которая выполняет смарт-контракты на блокчейне Ethereum. Это компьютерная программа, которая работает на каждом узле сети Ethereum. Она отвечает за выполнение кода смарт-контрактов и также предоставляет функции безопасности, такие как песочница и контроль использования газа. EVM гарантирует, что все участники сети Ethereum могут выполнять смарт-контракты последовательно и безопасно.

As you may know, blockchain data is stored as blocks with transactions inside. Also, you may know that there are two types of accounts:

External own account (Externally owned account) – described by any regular wallet address
Contract account (contract account) – described by any address of the deployed smart contract

If you send some Ether from your account to any other account of an external owner, nothing else happens outside of such a transaction.

But if you send some ether to a smart contract address with any payload, you are actually running some method in the smart contract that technically creates so-called “internal” transactions.

But if any transaction can be found on the blockchain, why not transform all the data into a large constantly updated database that can be queried in a SQL-like format?

The problem is that you can only access smart contract data if you have the “key” to decrypt it. Without this “key”, the smart contract data on the blockchain is actually bytecode. This key is called ABI (Application Binary Interface).

ABI (Application Binary Interface) - это стандарт, который определяет, как смарт-контракт взаимодействует с внешним миром, включая другие смарт-контракты и пользовательские интерфейсы. Он определяет структуру данных, сигнатуры функций и типы аргументов смарт-контракта для обеспечения правильного и эффективного общения между контрактом и его пользователями.

Every smart contract on the blockchain has an ABI. The problem is that you may not have the ABI for the smart contract you are interested in. Sometimes you can find an ABI file (which is actually a JSON file with the names of smart contract functions and variables, similar to an interface to communicate with it)

  • on Etherscan (if the smart contract has been verified, that is, the byte code has been compared with what is obtained from the sources)

  • on GitHub (if the developers have open-sourced the project)

  • or if the smart contract is of any standard type such as ERC-20 (fungible tokens like USDT), ERC-721 (non-fungible tokens or NFT), etc.

Of course, if you are a smart contract developer, you have the ABI, because this file is generated at compile time.

What it looks like from a developer’s point of view

But let’s not dwell on looking at data through ABI. What if we consider this topic from the point of view of the smart contract developer himself? What is a smart contract? The answer is much simpler than you might think. Here is a simple explanation for those who are familiar with object-oriented programming:

Смарт-контракт в коде разработчика - это класс с некоторыми полями и методами (для смарт-контрактов EVM-совместимых цепочек обычно используется язык программирования Solidity). И задеплоенный смарт-контракт становится как бы уже объектом этого класса. Таким образом, он живет своей жизнью, позволяя пользователям вызывать его методы и изменять его внутренние поля.

Please note that any method call with a change in the state of a smart contract means a transaction, which is usually accompanied by an event (“event”) that the developer “emits” directly from the code. Let’s illustrate an ERC-721 smart contract function call (a common standard for collections of non-fungible tokens like BoredApeYachtClub) that emits an event when NFT ownership changes.

/**
     * @dev Переводит tokenId от отправителя from к получателю to.
     * В отличие от {transferFrom}, это не накладывает ограничений на msg.sender.
     *
     * Требования:
     *
     * - to не может быть адресом zero.
     * - tokenId token должен принадлежать отправителю from.
     *
     * Создает событие {Transfer}.
     */

    function _transfer(address from, address to, uint256 tokenId) internal virtual {
        address owner = ownerOf(tokenId);
        if (owner != from) {
            revert ERC721IncorrectOwner(from, tokenId, owner);
        }
        if (to == address(0)) {
            revert ERC721InvalidReceiver(address(0));
        }

        _beforeTokenTransfer(from, to, tokenId, 1);

        // Проверяем, что tokenId не был передан через хук _beforeTokenTransfer
        owner = ownerOf(tokenId);
        if (owner != from) {
            revert ERC721IncorrectOwner(from, tokenId, owner);
        }

        // Очищаем одобрения предыдущего владельца
        delete _tokenApprovals[tokenId];

        // Уменьшаем баланс с проверенной арифметикой, потому что переопределение ownerOf может
        // нарушить предположение, что _balances[from] >= 1.
        _balances[from] -= 1;

        unchecked {
            // _balances[to] может переполниться в условиях, описанных в _mint. Для этого потребуется
            // сделать минтинг всех 2**256 токенов, что на практике невозможно.
            _balances[to] += 1;
        }

        _owners[tokenId] = to;

        emit Transfer(from, to, tokenId);

        _afterTokenTransfer(from, to, tokenId, 1);
    }

So what can we see here. To transfer an NFT from your address to any other address, you need to call the _transfer function, passing in the values ​​of these two addresses and the ID of that NFT. In the code, you can see that some checks will be performed and then user balances will change. But the important thing is that at the end of the function code there is a line:

emit Transfer(from, to, tokenId);

This means that these three values ​​will be “translated” to the outside and can be found in the blockchain logs. This is a much more efficient way to store the necessary historical data because storing the data directly on the blockchain is too expensive.

We have now defined all the necessary concepts to explain what indexing is.

Given the fact that any smart contract (being an object of some class) lives its own life, constantly being called by users (and other smart contracts) and changing state (while broadcasting “events”), indexing can be defined as the process of collecting smart contract data (any internal variables within the contract and not only those that are explicitly passed) throughout its life cycle, storing this data along with transaction identifiers (hash) and block numbers in order to be able to find any details in the future.

This is very important to note, because using the normal blockchain node interface, it is simply impossible to get, for example, the first transaction of wallet “A” with token “B” or the largest transaction in smart contract “C” (or any other information) if the smart – the contract explicitly does not store this data (as we know, this is very expensive).

That’s why we need indexing. Simple things that we can do in SQL database become impossible in blockchain. No indexing.

In other words, “indexing” here is synonymous with smart contract data collection, because no indexing means no access to the data in Web3.

How did developers index data in the past? They did it from scratch, for this:

– They write high-performance code in fast programming languages ​​like Go, Rust, etc.

– They set up a database to store the data.

– They set up an API to access data from the application.

– They run an archive node of the blockchain.

– In the first step, they scan the entire blockchain, finding all transactions associated with a particular smart contract.

– They process these transactions by storing new objects and updating existing objects in the database.

– When they reach the current block (chain head), they need to switch to a more complex mode to process new transactions, because each new block (even a block chain) can be rejected due to chain reorganization (which will lead to incorrect data in the database) .

– If the chain was reorganized, they need to go back to the block when the network reorganized and recalculate everything up to the new chain head.

As you can see, this is not very easy to both develop and maintain in real time, because each failure at the node level may require additional steps to restore the data up to date. That’s exactly why it appeared The Graph. It’s a simple idea that developers and end users need access to smart contract data without these kinds of problems.

The Graph project has defined a paradigm called “subgraph”, according to which, in order to extract data from a smart contract, you need to describe 3 things:

1. General parameters, such as the blockchain used, the address of the smart contract for indexing, the “events” processed and the initial block. These variables are defined in a so-called “manifest” file.

2. How to store data. What tables should be created in the database to store smart contract data? The answer will be found in the schema file.

3. How to collect data. What variables should be stored from “events”, what related data (eg transaction hash, block number, result of other method calls, etc.) should also be collected, and how they should be placed in certain schemas. This is described in the third file.

These three things can be beautifully defined in the following three files:

  • subgraph.yaml – manifest file

  • schema.graphql – schema description

  • mapping.ts – AssemblyScript file

Thanks to this standard, it is extremely easy to describe the entire indexing process by following any of these tutorials (they are in English for now and I plan to translate them if there is noticeable interest in the current article, so leave an upvote or comment below the article so that I understand what is worth continue this):

More tutorials on the topic can be found Here.

And here’s what it looks like:

As you can see here, The Graph is entirely responsible for indexing. However, you still need to run graph-node (this is open source software from The Graph), which actually turns out to be a more difficult task than just running a node (which strives to go out of date all the time). And here comes another paradigm shift.

Developers in the past have run their own blockchain nodes, phasing it out and handing that care over to Web3 infrastructure providers. The Graph proposed another architectural simplification: The Graph subgraph hosting platform, which works for the developer (“user” here) like this:

In this case, the user (or developer) does not have to run their own indexer or graph node, but still can control all the algorithms and not even fall into the vendor lock, because different providers use the same The Graph description format (Chainstack in particular, fully compatible with The Graph, but it’s also worth checking this claim for your Web3 infrastructure provider, if you have one). And that makes a big difference because it helps developers speed up the development process and lower maintenance transaction costs.

But what’s also great about this paradigm is that any time a developer wants to make their application truly decentralized, they can seamlessly migrate to The Graph decentralized network using the same subgraphs.

What did I miss in the previous story.

  • As you may have noticed, The Graph uses GraphQL instead of REST API. This allows users the flexibility to query any tables they create, joining them, and filtering them easily. There is good video about how to master it. Also, ChatGPT can help write GraphQL queries as I showed in this tutorial.

  • The Graph has its own service for hosting subgraphs with a large number of ready-to-use subgraphs. It is free, but, unfortunately, does not meet the requirements of production (reliability, SLA, support), as it is actually a sandbox for a decentralized network, in addition, synchronization is slower than with paid solutions, but can still be used for development. A tutorial on how to use ready-to-use subgraphs with Python can be found Here.

  • If you’re going to be using subgraphs in production, I would recommend hiring an experienced Web3 infrastructure provider such as Chainstackto achieve cost effectiveness along with reliability and speed.

  • If you feel insecure in the development of subgraphs, but still want to master them, feel free to ask questions in this telegram-chat, I managed to gather there several developers who have expertise in this topic and they answer questions from beginners.

    In general, if you have read up to here, and you would like me to translate other materials on data access on the blockchain or the development of smart contracts, please upvote and / or write in the comments what else would be interesting to read about.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *