[Паттерны API] Asynchrony and time management

This is chapter 19 of mine API books. v2 will contain three new sections: API Patterns, HTTP APIs and REST, SDKs and UI Libraries. If this work was useful to you, please rate the book on GitHub, Amazon or GoodReads. English version on Substack.

Let’s continue with the previous example. Let the application receive at the start some the state of the system may not be the most up-to-date. What else determines the probability of collisions and how can we reduce it?

Recall that this probability is equal to the ratio of the period of time required to obtain the current state to the typical period of time during which the user restarts the application and repeats the order. We practically cannot influence the denominator of this fraction (unless we deliberately introduce a delay in the initialization of the API, which we still consider an extreme measure). Let’s turn now to the numerator.

Our use case looks like this:

const pendingOrders = await api.
  getOngoingOrders();
if (pendingOrder.length == 0) {
  const order = await api
    .createOrder(…);
}
// Здесь происходит крэш приложения,
// и те же операции выполняются
// повторно
const pendingOrders = await api.
  getOngoingOrders(); // → []
if (pendingOrder.length == 0) {
  const order = await api
    .createOrder(…);
}

Thus, we aim to minimize the following time interval: network delay of command transmission createOrder + execution time createOrder + time to propagate changes to replicas. The first is again out of our control (but fortunately we can hope that the network delays within a session are plus or minus constant, and thus the subsequent call getOngoingOrders will be delayed by about the same amount). the third, most likely, will be provided by our backend infrastructure. Let’s talk now about the second time.

We see that if the creation of the order itself takes a very long time (here “very long” = “comparable to the application launch time”), then all our efforts are practically useless. The user may get tired of waiting for the call to complete createOrderupload the application and send a second (and more) createOrder. It is in our interest to ensure that this does not happen.

But how can we really improve this time? After all, creating an order really can be lengthy – we need to perform a lot of checks and wait for the payment gateway to respond, confirmation of acceptance of the order by the coffee shop, etc.

This is where asynchronous calls come to the rescue. If our goal is to reduce the number of collisions, then there is no need for us to wait until the order is really created; our goal is to spread the knowledge as quickly as possible over the replicas that the order accepted for creation. We can do the following: create not an order, but a job to create an order, and return its ID.

const pendingOrders = await api.
  getOngoingOrders();
if (pendingOrder.length == 0) {
  // Вместо создания заказа
  // размещаем задание на создание
  const task = await api
    .putOrderCreationTask(…);
}
// Здесь происходит крэш приложения,
// и те же операции выполняются
// повторно
const pendingOrders = await api.
  getOngoingOrders(); 
  // → { tasks: [task] }

Here we assume that job creation requires minimal checks and does not wait for the execution of any lengthy operations, and therefore is much faster. In addition, we can entrust this operation itself – creating an asynchronous task – to a separate abstract task service as part of the backend. Meanwhile, having the functionality of creating tasks and getting a list of current tasks, we significantly reduce the “gray zones” of the state of uncertainty, when the client cannot know the current state of the server for sure.

Thus, we naturally come to the pattern of organizing an asynchronous API through job queues. We use the term “asynchrony” here logically – implying the absence of mutual logical locks: the sending side receives a response to its request immediately, without waiting for the requested functionality to complete, and can continue interacting with the API while the operation is in progress. At the same time, technically, in modern systems, client (and server) blocking almost always does not occur when accessing synchronous endpoints – however, logically continuing to work with the API without waiting for a response to a synchronous request can be fraught with collisions like those described above.

The asynchronous approach can be used not only to eliminate collisions and uncertainty, but also to solve other applied problems:

  • organization of links to the results of the operation and their caching (it is assumed that if the client needs to read the result of the operation again or share it with another agent, he can use the job identifier for this);

  • ensuring the idempotency of operations (for this, it is necessary to enter a confirmation of the task, and we will actually get a scheme with drafts of the operation, described in the chapter “Description of final interfaces”);

  • native provision of resistance to a temporary surge in the load on the service – new tasks are queued (possibly prioritized), actually implementing “marker bucket”;

  • organization of interaction in cases where the operation execution time exceeds reasonable values ​​(in the case of network APIs, the typical response time of network timeouts, i.e. tens of seconds) or is unpredictable.

In addition, asynchronous interaction is more convenient from the point of view of API development in the future: the design of the system that processes such requests can change towards complication and lengthening of the task execution pipeline, while synchronous functions will have to fit within reasonable time frames in order to remain synchronous – which , of course, limits the ability to refactor internal mechanics.

NB: sometimes you can find a solution in which the endpoint has a dual interface and can return both the result and a link to the execution of the task. Although for you, as an API developer, it may look logical (if you were able to “quickly” execute a request, for example, get a result from the cache, you returned a response; if you couldn’t, you returned a link to the task), this solution is extremely inconvenient for API users, since it forces you to maintain two branches code at the same time. There is also a paradigm of giving the developer two sets of endpoints, synchronous and asynchronous, to choose from, but in fact this is just shifting the responsibility to the partner.

The popularity of this pattern is also due to the fact that many modern microservice architectures “under the hood” also interact asynchronously – either through event streams or through asynchronous job setting. Implementing a similar asynchrony in an external API is the easiest way to get around the problems that arise (read, the same unpredictable and possibly very large delays in the execution of operations). It comes to the point that in some APIs absolutely all operations are made asynchronous (including reading data), even if there is no need for this.

However, we cannot fail to note that, despite its attractiveness, ubiquitous asynchrony entails a number of rather unpleasant problems.

  1. If a single queuing service is used for all endpoints, then it becomes a single point of failure. If events do not have time to be published and / or processed, there is a delay in execution in all endpoints. If, on the contrary, a separate queue service is organized for each functional domain, then this leads to a multiple complication of the internal architecture and an increase in the cost of monitoring and correcting problems.

  2. Writing code for a partner becomes much more difficult. It’s not even a matter of the physical amount of code (after all, creating a common component for interacting with the job queue is not such a difficult task), but that now, with respect to each call, the developer must ask himself the question: what will happen if it is processed will take a long time. If, in the case of synchronous endpoints, we assume by default that they complete in some reasonable time, less than the typical request timeout (for example, in client applications, you can simply show the user a spinner), then in the case of asynchronous endpoints, we do not have such a guarantee simply no, it cannot be given.

  3. Using job queues can bring its own problems, not related to the actual processing of the request:

    • the task may be “lost”, i.e. never be processed;

    • status change events may arrive in the wrong order and/or be repeated, which may affect public interfaces;

    • Incorrect data (corresponding to another job) may be placed under the job ID by mistake, or the data may be corrupted.

    These situations can be completely unexpected for developers and lead to extremely difficult-to-reproduce bugs in applications.

  4. As a consequence of the above, the question arises of the meaningfulness of the SLA of such a service. Through asynchronous tasks, you can easily raise the API uptime to 100% – it’s just that some requests will be completed in a couple of weeks, when the support team finally finds the reason for the delay. But such guarantees are, of course, completely unnecessary for users of your API: their users usually want to complete the task Now or at least within a reasonable time, and not in two weeks.

Therefore, for all the attractiveness of the idea, we still tend to recommend limiting asynchronous interfaces only where they are really critical (as in the example above, where they reduce the likelihood of collisions), and at the same time have separate queues for each case. The ideal solution with queues is one that is inscribed in the business logic and does not look like a queue at all. For example, nothing prevents us from declaring the “order creation task accepted and awaiting execution” state as a separate order status, and making its identifier the identifier of a future order:

const pendingOrders = await api.
  getOngoingOrders();
if (pendingOrder.length == 0) {
  // Не называем это «заданием» —
  // просто создаём заказ
  const order = await api
    .createOrder(…);
}
// Здесь происходит крэш приложения,
// и те же операции выполняются
// повторно
const pendingOrders = await api.
  getOngoingOrders(); 
  /* → { orders: [{
    order_id: <идентификатор задания>,
    status: "new"
  }]} */

NB: note also that in the asynchronous interaction format, you can transfer not only the binary status (task completed or not), but also the progress in percentage, if possible.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *