Apache Ignite transaction architecture

6 min


In this article, we will look at how transactions work in Apache Ignite… We will not dwell on the concept of Key-Value storage, but go straight to how this is implemented in Ignite. Let’s start with an overview of the architecture, and then illustrate the key points of transaction logic using tracing. With simple examples, you will see how transactions work (and for what reasons they may not work).

Aside Needed: Apache Ignite Cluster

The cluster in Ignite is a multitude of server and client nodes, where server nodes are combined into a logical structure in the form of a ring, and client nodes are connected to the corresponding server nodes. The main difference between client nodes and server nodes is that the former do not store data.


Data, from a logical point of view, belongs to partitions, which, in accordance with some affinity function, are distributed across nodes (more about data distribution in Ignite). The main (primary) partitions can be copies (backups).

How transactions work in Apache Ignite

The cluster architecture in Apache Ignite imposes a certain requirement on the transaction mechanism: data consistency in a distributed environment. This means that the data located on different nodes must change integrally from the point of view ACID principles. There are a number of protocols available to do what you want. Apache Ignite uses an algorithm based on two-phase commit, which consists of two stages:

  • prepare;
  • commit;

Note that, depending on transaction isolation level, the mechanism of taking locks and a number of other parameters, the details in the phases may change.

Let’s see how both phases take place using the following transaction as an example:

Transaction tx = client.transactions().txStart(PESSIMISTIC, READ_COMMITTED);
client.cache(DEFAULT_CACHE_NAME).put(1, 1);
tx.commit();

Prepare phase

  1. The node – the coordinator of transactions (near node in terms of Apache Ignite) – sends a prepare-message to the nodes containing primary-partitions for all the keys participating in this transaction.
  2. Nodes with primary partitions send a Prepare message to the corresponding nodes with backup partitions, if any, and grab the necessary locks. In our example, there are two backup partitions.
  3. Nodes with backup partitions send Acknowledge messages to nodes with primary patricians, which, in turn, send similar messages to the node coordinating the transaction.

Commit phase

After receiving confirmation messages from all nodes containing primary partitions, the transaction coordinator node sends a Commit message, as shown in the figure below.

A transaction is considered complete the moment the transaction coordinator has received all Acknowledgment messages.

From theory to practice

To consider the logic of a transaction, let’s turn to tracing.

To enable tracing in Apache Ignite, follow these steps:

  • Let’s enable the ignite-opencensus module and set OpenCensusTracingSpi as tracingSpi via cluster configuration:
    <bean class="org.apache.ignite.configuration.IgniteConfiguration">
        <property name="tracingSpi">
            <bean class="org.apache.ignite.spi.tracing.opencensus.OpenCensusTracingSpi"/>
        </property>
    </bean>
    

    or

    IgniteConfiguration cfg = new IgniteConfiguration();
    
    cfg.setTracingSpi(
        new org.apache.ignite.spi.tracing.opencensus.OpenCensusTracingSpi());
    

  • Let’s set some non-zero level of sampling transactions:
    JVM_OPTS="-DIGNITE_ENABLE_EXPERIMENTAL_COMMAND=true" ./control.sh --tracing-configuration set --scope TX --sampling-rate 1
    

    or

    ignite.tracingConfiguration().set(
                new TracingConfigurationCoordinates.Builder(Scope.TX).build(),
                new TracingConfigurationParameters.Builder().
                        withSamplingRate(SAMPLING_RATE_ALWAYS).build());
    

    Let’s dwell on a few points in more detail:

    • The tracing configuration belongs to the class of experimental API and therefore requires the flag
      JVM_OPTS="-DIGNITE_ENABLE_EXPERIMENTAL_COMMAND=true"
    • We set the sampling-rate to one, so all transactions will be sampled. This is justified for illustrative purposes of the material in question, but is not recommended for use in production.
    • Changing the tracing parameters, with the exception of setting SPI, has a dynamic nature and does not require restarting the cluster nodes. Below, in the corresponding section, the available settings will be discussed in more detail.

  • Let’s run a PESSIMISTIC, SERIALIZABLE transaction from a client node on a three node cluster.
    Transaction tx = client.transactions().txStart(PESSIMISTIC, SERIALIZABLE);
    client.cache(DEFAULT_CACHE_NAME).put(1, 1);
    tx.commit();

Let’s turn to the GridGain Control Center (a detailed overview of the tool) and take a look at the resulting spawn tree:

In the illustration, we can see that the transaction root span created at the beginning of the transactions (). TxStart call spawns two conditional span groups:

  1. The lock capture machine initiated by the put () operation:
    1. transactions.near.enlist.write
    2. transactions.colocated.lock.map with sub-stages
  2. transactions.commit, created when tx.commit () is called, which, as previously mentioned, consists of two phases – prepare and finish in Apache Ignite terms (the finish phase is identical to the commit phase in the classic two-phase commit terminology).

Let’s now take a closer look at the prepare-phase of a transaction, which, starting at the transaction coordinator node (near-node in Apache Ignite terms), produces transactions.near.prepare span.

Once on the primary partition, the prepare-request triggers the creation of transactions.dht.prepare span, within which prepare-requests are sent to the tx.process.prepare.req backups, where they are processed by tx.dht.process.prepare.response and sent back to the primary partition, which sends a confirmation message to the transaction coordinator, along the way creating a span tx.near.process.prepare.response. The Finish phase in this example will be similar to the prepare phase, which saves us from the need for detailed analysis.

By clicking on any of the spans, we will see the corresponding meta information:

So, for example, for the root transaction span, we see that it was created on the client node 0eefd.

We can also increase the granularity of transaction tracing by enabling tracing of the communication protocol.

Setting up tracing parameters

JVM_OPTS="-DIGNITE_ENABLE_EXPERIMENTAL_COMMAND=true" ./control.sh --tracing-configuration set --scope TX --included-scopes Communication --sampling-rate 1 --included-scopes COMMUNICATION

or

       ignite.tracingConfiguration().set(
           new TracingConfigurationCoordinates.Builder(Scope.TX).build(),
           new TracingConfigurationParameters.Builder().
               withIncludedScopes(Collections.singleton(Scope.COMMUNICATION)).
               withSamplingRate(SAMPLING_RATE_ALWAYS).build())

Now we have access to information about the transmission of messages over the network between cluster nodes, which, for example, will help answer the question of whether a potential problem was caused by nuances of network communication. We will not dwell on the details, we only note that the set of socket.write and socket.read spans are responsible for writing to the socket and reading one or another message, respectively.

Exception Handling and Crash Recovery

Thus, we see that the implementation of the distributed transaction protocol in Apache Ignite is close to the canonical one and allows you to obtain the proper degree of data consistency, depending on the selected transaction isolation level. Obviously, the devil is in the details and a large layer of logic remained outside the scope of the material analyzed above. So, for example, we have not considered the mechanisms of operation and recovery of transactions in the event of a fall of the nodes participating in it. We will fix this now.

We said above that in the context of transactions in Apache Ignite, three types of nodes can be distinguished:

  • Transaction coordinator (near node);
  • Primary node for the corresponding key (primary node);
  • Nodes with backup key partitions (backup nodes);

and two phases of the transaction itself:

  • Prepare;
  • Finish;

Through simple calculations, we will get the need to process six options for node crashes – from a backup fall during the prepare-phase to a fall of the transaction coordinator during the finish-phase. Let’s consider these options in more detail.

Falling backup both on prepare and on finish-phases

This situation does not require any additional action. The data will be transferred to the new backup nodes independently as part of the rebalance from the primary node.

Falling primary-node in the prepare-phase

If there is no risk of receiving inconsistent data, the transaction coordinator throws an exception. This is a signal to transfer control to make a decision to restart the transaction or another way to resolve the problem to the client application.

Fall of the primary node in the finish phase

In this case, the transaction coordinator waits for additional NodeFailureDetection messages, after receiving which it can decide on the successful completion of the transaction, if the data was written on backup-partitions.

Fall of the transaction coordinator

The most interesting case is loss of transaction context. In such a situation, the primary and backup nodes directly exchange the local transactional context with each other, thereby restoring the global context, which allows a decision to be made to verify the commit. If, for example, one of the nodes reports that it did not receive a Finish message, the transaction will be rolled back.

Summary

In the above examples, we examined the flow of transactions, illustrating it using tracing, which shows the internal logic in detail. As you can see, the implementation of transactions in Apache Ignite is close to the classic concept of two-phase commit, with some tweaks in the area of ‚Äč‚Äčtransaction performance related to the mechanism of taking locks, features of recovery after failures and transaction timeout logic.


0 Comments

Leave a Reply