How We Created an API That Has Been Developing for 10 Years Without Dropping Backward Compatibility

Hi! My name is Vadim Kleba, I lead the backend development team at Telemost. For the last nine years, I have been developing high-load distributed systems. Previously, I developed a search-as-a-service solution with efficient full-text search taking into account relevance.

In this article, I will tell you how Yandex 360 built an API for ten years without dropping backward compatibility, which now withstands hundreds of thousands of RPS. You will learn what approaches we initially laid down so that our API would live for so long.

What we will talk about in the article

In 2014, Yandex Mail, Yandex Disk, Yandex Messenger existed as separate services. In 2021, we combined them into a single digital solution for teams – Yandex 360. But when we started designing our API Gateway in 2014, it was intended only for Yandex Disk.

API Gateway is a pattern in microservice architecture that provides a single point of access to API. API Gateway can accept, process and distribute requests. It is convenient to build traffic control systems on it, such as Rate Limiter, caches, and so on.
In this article I will reveal how it has developed over ten years of active use and has not lost its relevance. I will note that this will be a story about our path, and not a hardcore tutorial.

API Gateway: Capabilities and Limitations

Our API Gateway is called CloudAPI. All incoming requests from our users and third-party systems go through it.

API Gateway encapsulates knowledge about the backend system. It is similar to the facade design pattern, but the scale of application changes.

Since all requests go through it, it is convenient to monitor them. Also, if you need to restrict access to handles, it is also convenient to do this on API Gateway. By the way, in Yandex we call endpoints or handlers handles.

API Gateway is not a silver bullet. Like any other service, it has its downsides:

  • Increased response time. API Gateway is another system that needs to accept and process a request, and do something with it. Your latency per request will increase.

  • Single point of failure. Anything behind API Gateway will be unavailable if it goes down.

  • Additional costs. This is another system that needs to be released, supported and maintained.

Chronicles of API Gateway development in Yandex 360

2014: Created a new maple API based on the domain model

In 2014, our team had a task: to create a new public Disk API. At that time, REST was the buzzword. No one knew how to cook it, but everyone really wanted to learn.

At that time, we already had a public API in the form of WebDAV API, but it was becoming cramped for us. WebDAV API is a protocol for managing a remote file system, and Yandex Disk is a cloud service that goes beyond the boundaries of a regular storage. You can read more about why we did not choose WebDAV API in the article «New REST API of Yandex.Disk and Polygon. And also why Disk needs another API and how we made it».

Richardson maturity model

We started designing our API inspired by Martin Fowler's article “The Richardson Maturity Modewhere he describes the 2010 book “REST in practice”, which says that Web APIs are divided into four layers:

Level #0. There is one http method, there is one url. And the content of the request defines what we want from the server. Also, we do not process http codes in any way, the 200 code is enough for us. And the error description comes in the content-response itself. This is approximately how RPC works.

Level #1. At this level, we connect resources. That is, from the body of the request, either the function or the name of the entity flows into the url itself. And we still use one method.

Level #2. At this level, we connect http methods that carry some function or actions over resources: GET, POST, PUT. And also at this level, we begin to process http codes.

Level #3. HATEOAS. At this level, in addition to the useful body that we return in the response, we also return links for hypermedia transitions – we chose this level.

We took the standard HAL and implemented its support in our framework. You can still see the handles with HAL support in the Disk REST API. However, it quickly became clear that users did not really need the HAL API, and its support required additional labor costs. Therefore, we quickly abandoned it and rolled back to level #2.

We didn't just design the API. We already knew then that we would have other services besides Disk. And we built a service that we called CloudAPI. We didn't know then that it would be the future representation of our API Gateway. We also immediately brought in Open API spec generations because we wanted our code to be machine-readable, meaning we could generate something from our spec.

Back then it was Swagger standard 1.2. We even got to work a little in the working group to form the standard.

So what does our API look like now? Since we work with the user's Disk, we have files and folders. In fact, files and folders are our domain models, we do not separate them in any way, for us they are some kind of resources. And this became the first abstraction for us that we use to design our API. We have methods – copy or move. These operations are long, so we perform them asynchronously. As soon as you perform a copy, we immediately give the operation identifier, and you can already track the status of this operation and progress by pulling a separate handle. The operation is the second abstraction that we used.

However, our pens became verbose and heavy. Our clients and front-end developers no longer needed the whole answer, but needed a separate part. /

How can this problem be solved? First of all, we can create new handles for a specific client and task that will return the required response. However, we do not want to create another API.

And then we made a global parameter on our gateway, which we called fields. It allows us to control the output of content. In the request, we specify the query parameter, where we list the fields separated by commas. The gateway, as soon as it begins to form a response, filters out all unnecessary fields that we did not request, and returns it.

Hidden text

We get a big Response:


GET 
/disk/resources?path=%2F

{
  "_embedded": {
    "sort": "",
    "items": [
      {
        "name": "СodeFest",
        "exif": {},
        "created": "2024-01-18T10:54:10+00:00",
        "resource_id": "1584921471:29e4d973fa7a12a0deba0c3033ee2302f9b948207cf92abceead57e081cc569b",
        "created": "2024-01-18T10:54:10+00:00",
        "path": "disk:/CodeFest",
        "comment_ids": {
          "private_resource": "1584921471:29e4d973fa7a12a0deba0c3033ee2302f9b948207cf92abceead57e081cc569b",
          "public_resource": "1584921471:29e4d973fa7a12a0deba0c3033ee2302f9b948207cf92abceead57e081cc569b"
        },
        "type": "dir",
        "revision": 1660820050819410
      }
    ],
    "limit": 20,
    "offset": 0,
    "path": "disk:/",
    "total": 33
  },
  "name": "disk",
  "exif": {},
  "resource_id": "1584921471:7f8fc90be644c00fed5285798e97e874d295aaa0e8e0fe99ecf7ca412e5f3fe1",
  "created": "2012-04-04T20:00:00+00:00",
  "modified": "2012-04-04T20:00:00+00:00",
  "path": "disk:/",
  "comment_ids": {},
  "type": "dir",
  "revision": 1647409007892806
}

What we need:

GET /disk/resources?path=%2F&fields=_embedded.items.name

{
  "_embedded": {
    "items": [
      {
        "name": "СodeFest"
      }
    ]
  }
}

2015: learned to execute several subqueries using one request. Created Batch API

We started making a new feature – notifications on the web: when you upload a file, a separate window appears with the upload status.

What does the feature look like?

What does the feature look like?

When you upload a photo to a folder, we render two thumbnails. There is a large photo and a large preview in the file in the folder, and there is a small cropped thumbnail that is in a separate window. We can request the thumbnail via a meta-information request.

How can we implement this?

Method No. 1. It is possible to make a custom handlebut we won't do that. We need the logic to be the same on our different platforms.

Method No. 2. You can modify the existing handle to return multiple previews. You can pass the preview settings separated by commas, or pass multiple values ​​separated by ampersands. This won't work because we have a problem: we won't be able to match the response to what we asked for. Not all web servers handle multiple assignments separated by ampersands.

Way №3. Make one request, in which the client will indicate which sub-requests he needs to execute. The client makes a request, specifies in the database a list of sub-requests that he needs to make. At the gateway, we send these requests, collect them into a single response and send them to the client. Profit!

That's how we came up with the Batch API. Now our clients can do all sorts of combinations to get something in one request.

// Запрос

{
  "items": [
    {
      "method": <HTTP-метод>,
      "relative_url": <Относительный URL>,
      "headers": {
<Идентификатор заголовка>: <Значение>
}  // опционально
    },
    ...
  ]
}

// Ответ

{
  "items": [
    {
      "code": <HTTP-код ответа>,
      "body": <Тело ответа в виде строки>,
      "headers": <Заголовки ответа>
    },
    ...
  ]
}

How does it work under the hood? We have batch processors. When a request comes to us in the Batch API, we select the required batch processor. If there is no custom batch processor, then we use the default one, which executes requests in parallel, collects the response and sends it.

We needed this functionality later.

Hidden text

We have the technology – DataSync. This is a synchronization mechanism with the ability to synchronize from any place in the data array. For example, you save a bookmark in Yandex Browser, open your phone, and this bookmark appears there.

In addition to the Browser, the Yandex home page uses this functionality. Previously, it took information about widgets from there – this was back in the days when they were drawn on the home page. And so that these widgets were displayed the same way on all devices, the home page saved information about the settings in DataSync. And it also took information about your work and home address to show you how long it takes to get home or work.

In order for the main page to open quickly, all requests that are executed must fit within 100 milliseconds. The current request did not fit within this time.

So what did we do? We wrote a custom batch processor. We made an optimal handle in DataSync that accepts multiple sources from which to fetch data. And it runs optimal queries, pulls data from the database. When a query comes to us, we understand that we need to use a custom batch processor. We make a query in our optimal handle and assemble the response as if there were two queries. And so we started to meet the required time. Our users did not notice that anything had changed – they still think that we are running two queries asynchronously.

2016: The Era of Smart Tapes and the Emergence of Chaining Processors

2016, applications began to focus on engagement, many began to implement smart feeds in their applications. Yandex Disk is no exception. We implemented a feed that shows you information about files that you recently downloaded and recent photos.

But here we encountered a problem: with a bad Internet connection, this feed did not load, because it consisted of blocks. One of these blocks was a selection of recent photos. These selections contained identifiers for the files, and we needed to get the pictures themselves, not the identifiers. Ideally, in one request. How can this problem be solved?

You can make a custom handle… But we need the same logic on all platforms. We used the experience of building a batch API, modified it a little, and we got chaining API.

Example request:

{
  "method": "GET",
  "relative_url": "/v1/disk/resources?path=%2F",
  "subrequests": [
    {
      "method": "GET",
      "relative_url": "/v1/disk/resources?path={body.items.0.name}"
    },
    {
      "method": "GET",
      "relative_url": "/v1/disk/resources?path={body.items.1.name}?fake={headers.Content-Length}"
    }
  ]
}

In one request, we can build a chain of calls and use the content of the previous request to execute the next request. Under the hood chaining API the same as in the Batch API — chaining processors. We have a default chaining processor that executes requests sequentially, assembles a single response and sends it to the client. So far, there has been no case where we needed this functionality in ten years, but we believe that the time of chaining processors will come.

So we have a universal mechanism where we can build chains of calls for one request. In this way, we reduce the roundtrip for mobile clients and they begin to work more optimally with poor internet.

2017–2019: How We Made a Universal Mechanism for Delivering Large Content

The pattern of using applications has been changing since 2014. New phones with cool cameras are released, there is not enough space on the phone and people are using cloud storage more actively and saving their files there and synchronizing this data with the cloud.

Because of this, a full download of a snapshot (a copy of the database at the time of synchronization with the cloud) began to take a significant amount of time. This is a difficult operation for the client, he needs to download a large amount of data. And if the Internet is lost during the process, then this amount of data needs to be downloaded again. It is also a difficult operation for the backend. In addition to the fact that the user has his own files, some files and folders could have been shared with him, and we also need to go to other users for this data. How can this problem be solved?

In order for the client to receive all the information without interruption, it needs to be given in pieces. The solution with page/offset/limit immediately comes to mind. It will work, but we need to remember that we have mobile clients. They have a tail of old versions that lasts about two years. If we want to change something in our unloading, we will break the old clients. We can make a new handle… but this is not our option. We want to manage this synchronization with the backend.

We did it iteration key. This is a key that we generate on the backend and return it with each request. The client sends this key to get the next batch of data. The key itself contains information about the current unloading progress. The backend, receiving the key, decrypts it, understands what the next batch of data to give it, generates this batch and a new key and gives it to the client. Yes, we decrypt it. If you give something conscious in this key, know that the frontend will parse it, use it differently than you intended, and your synchronization will break.

Запрос: 

GET
/v1/disk/snapshot

Ответ: 

Response: 
{
  items: …,
  iteration_key: "SOME_VALUE"
}

Запрос:

GET
/v1/disk/snapshot?iteration_key=SOME_VALUE

What does this approach give us? The client iteratively receives the key, sends it to get the next portion of data. And on the backend, we have fully managed synchronization functionality. If we need to change how we synchronize data, we start sending a new key and do not break our old clients.

So we got a universal mechanism for delivering large content, which we called iteration_key.

What's the bottom line?

After 10 years, the Yandex Disk API has grown — what you see in the public domain is just the tip of the iceberg. The number of handles has grown, and these are not ordinary CRUD handles, but handles with complex logic. Our CloudAPI has acquired functionality that can be used not only by the Yandex Disk service, but also by other Yandex 360 products — Telemost, Billing, Calendar. And all our new services in Yandex 360 immediately begin to use our API Gateway.

To better understand the scale, I will say that We started with ten thousand RPS, and now it's hundreds of thousands of RPS daily.

We are actively continuing to develop our API Gateway – we are adding our new functionality there, and the approaches that we initially formed have helped our API survive for many years without rewriting.

So what has allowed us to survive so long?

API is a domain model. When you design an API, you should design the real world without regard to platforms, clients, frameworks, and other factors. Because reality changes evolutionarily, not revolutionary. Versions of frameworks, clients, languages ​​should not affect your decision in any way.

Yandex Disk API has survived three web design versions, software, and two mobile versions. During the time that Yandex Disk API lived, we got Windows Phones, and then they died. This proves that clients are changeable, but the subject area is not.

An API should fit into a universal, long-lived, widely used standard. And for us, that's http, because it works the same on mobile, desktop, TV, and other devices. Packaging an API within http gives us the same universality as http itself.

That's it. Build your API to be antifragile!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *