The history of creating a cloud service flavored with cyberpunk

With the growth of experience in IT, you begin to notice that the systems have their own character. They can be complaisant, silent, eccentric, severe. They can be attractive or repulsive. One way or another, you have to “negotiate” with them, maneuver between the “pitfalls” and build chains of their interaction.

So we had the honor to build a cloud platform, and for this it took us to “persuade” a couple of subsystems to work with us. Fortunately, we have an “API language”, straight arms and a lot of enthusiasm.

This article will not have technical hardcore, but there will be a description of the problems that we encountered when building the cloud. I decided to describe our path in the form of a light technical fantasy about how we looked for a common language with systems and what came of it.

Welcome to cat.

The beginning of the way

Some time ago, our team was tasked with launching a cloud platform for our customers. At our disposal was support for management, resources, a hardware stack, and freedom in choosing technologies for implementing the software part of the service.

There were also a number of requirements:

  • the service needs a convenient personal account;
  • The platform must be integrated into the existing billing system;
  • hardware and software: OpenStack + Tungsten Fabric (Open Contrail), which our engineers learned to cook quite well.

We will tell about how the team gathered, the interface of the personal account and design decisions were made, another time, if the habr community has interest.
Tools we decided to use:

  • Python + Flask + Swagger + SQLAlchemy is a completely standard Python suite;
  • Vue.js for the frontend;
  • They decided to do the interaction between components and services using Celery on top of AMQP.

Anticipating the choice questions for Python, I’ll explain. Language has taken its place in our company and a small, but still culture, has developed around it. Therefore, it was decided to start building the service on it. Moreover, the speed of development in such tasks often solves.

So, let’s begin our acquaintance.

Silent Bill – Billing

We have known this guy for a long time. He always sat beside him and silently considered something. Sometimes he forwarded user requests to us, set client bills, and managed services. Normal working guy. True, there were difficulties. He is silent, sometimes thoughtful and often on his mind.

Billing is the first system we tried to make friends with. And the first difficulty we encountered in the processing of services.

For example, when creating or deleting a task, it falls into the internal billing queue. Thus, a system of asynchronous work with services is implemented. To process our types of services, we needed to “put” our tasks in this queue. And here we are faced with a problem: lack of documentation.

Judging by the description of the software API, it is still possible to solve this problem, but we did not have time to do reverse engineering, so we took the logic out and organized a task queue on top of RabbitMQ. The operation on the service is initiated by the client from the personal account, turns into a Celery “task” on the backend and is performed on the billing and OpenStack’a side. Celery allows you to conveniently manage tasks, organize repetitions and monitor the status. More information about “celery” can be found, for example, here.

Also, billing did not stop the project on which the money ended. Communicating with developers, we found out that when counting according to statistics (and we need to implement just such logic) there is a complex interconnection of stopping rules. But these models do not fit our realities well. We also implemented through tasks on Celery, taking the service management logic to the backend side.

Both of the above problems led to the fact that the code was a little bloated and in the future we would have to do refactoring in order to put the logic of working with tasks into a separate service. We also need to store some of the information about users and their services in our tables in order to maintain this logic.

Another problem is silence.

Billy silently responds to some of the API requests with “OK.” So, for example, it was when we made the enrollment of the promised payments for the duration of the test (about it later). Requests were correctly executed and we did not see errors.

I had to study the logs, working with the system through the UI. It turned out that the billing itself performs similar requests, changing the scope to a specific user, for example, admin, passing it in the su parameter.

In general, despite the documentation gaps and minor API flaws, everything went pretty well. Logs can be read even under heavy load, if you understand how they are arranged and what you need to look for. The structure of the database is florid, but quite logical and in some ways even attractive.

So, summing up, the main problems that we had at the stage of interaction are related to the specifics of the implementation of a specific system:

  • undocumented “features” that affected us in one way or another;
  • closed sources (billing is written in C ++), as a result – the inability to solve problem 1 in any way, except for the “trial and error method”.

Fortunately, the product has a rather broad API and we integrated the following subsystems into our personal account:

  • technical support module – requests from your personal account are “proxied” to billing transparently for service customers;
  • financial module – allows you to bill current customers, write off and generate payment documents;
  • service management module – for him we had to implement our handler. The extensibility of the system played into our hands and we “taught” Billy a new type of service.
    I had to tinker, but somehow, I think, with Billy we get along.

Tungsten Field Walks – Tungsten Fabric

Tungsten fields dotted with hundreds of wires driving thousands of bits of information through themselves. Information is collected in “packages”, disassembled, building complex routes, as if by magic.

This is the patrimony of the second system we had to make friends with – Tungsten Fabric (TF), formerly OpenContrail. Its task is to manage network equipment by providing software abstraction to us as users. TF – SDN, encapsulates the complex logic of working with network equipment. There is a good article about the technology itself, for example, here.

The system is integrated with OpenStack (which will be discussed below) through the Neutron’a plugin.

OpenStack Services Interaction.

The guys from the operation department introduced us to this system. We use the system API to manage the network stack of our services. She still doesn’t give us serious problems or inconveniences (I won’t take it for the guys from the MA), however, there were some oddities of interaction.

The first looked like this: commands that require a large amount of data to be output to the instance console when connecting via SSH simply “hung up” the connection, while everything worked correctly on VNC.

For those who are not familiar with the problem, this looks rather funny: ls / root works correctly, while, for example, top “hangs” tightly. Fortunately, we have already encountered similar problems. It was decided by tuning MTU on the route from compute nodes to routers. By the way, this is not a TF problem.

The next problem was waiting around the bend. At one “beautiful” moment, the magic of routing disappeared, just like that. TF stopped managing routing on equipment.

We worked with Openstack from the admin level and after that we switched to the level of the required user. The SDN seems to “intercept” the scope of the user who is performing the action. The fact is that the same admin account is used to communicate TF and OpenStack. At the step of switching under the user, the “magic” disappeared. It was decided to create a separate account to work with the system. This allowed us to work without breaking the integration functionality.

Silicone Lifeforms – OpenStack

A bizarre silicone creature lives close to tungsten fields. Most of all, it looks like a child overgrowth, which with one swipe can crush us, but there is no apparent aggression from him. It does not cause fear, but its size inspires fear. As well as the complexity of what is happening around.

OpenStack is the core of our platform.

OpenStack has several subsystems, of which we most actively use Nova, Glance and Cinder. Each of them has its own API. Nova is responsible for compute resources and creating instance’s, Cinder is responsible for managing volume’s and their images, Glance is an image service that manages OS templates and meta-information on them.

Each service is launched in the container, and the “white rabbit” – RabbitMQ acts as a message broker.

This system has given us the most unexpected troubles.

And the first problem was not long in coming when we tried to connect an additional volume to the server. The Cinder API flatly refused to perform this task. More precisely, if you believe OpenStack’s own connection is established, however, there is no disk device inside the virtual server.

We decided to “bypass” and requested the same action from the Nova API. Result – the device connects correctly and is accessible inside the server. It seems that the problem occurs when block-storage does not respond to Cinder’s.

Another difficulty awaited us when working with disks. The system volume could not be disconnected from the server.

Again, OpenStack itself “swears” that it destroyed the connection and now it is possible to work correctly with volume’s separately. But the API categorically did not want to perform operations on the disk.

Here we decided not to fight especially, but to change our view of the logic of the service. If there is an instance, there must be a system volume. Therefore, the user cannot yet remove or disconnect the system “disk” without deleting the “server”.

OpenStack is a fairly complex set of systems with its own interaction logic and ornate API. We are helped out by rather detailed documentation and, of course, a trial and error method (where without it).

Test run

We carried out a test launch in December last year. The main task was to verify in combat mode our project from the technical side and from the UX. The audience was invited selectively and testing was closed. However, we also left the opportunity to request access to testing on our website.

The test itself, of course, could not do without curious moments, because this is where our adventures are just beginning.

Firstly, we somewhat incorrectly assessed the interest in the project and we had to quickly add compute nodes right during the test. The usual case for the cluster, however, there were nuances. The documentation for a specific version of TF indicates the specific kernel version on which work with vRouter was tested. We decided to run nodes with more recent kernels. As a result, TF did not receive routes from the nodes. I had to urgently roll back the core.

Another funny thing is related to the functionality of the “change password” button in your account.

We decided to use JWT to organize access to your personal account so as not to work with sessions. Since the systems are diverse and widely scattered, we manage our token, in which we “wrap” sessions from billing and the token from OpenStack’a. When changing the password, the token, of course, “goes bad”, since the user data is already invalid and needs to be re-issued.

We lost sight of this moment, and the resources to quickly add this piece were trite enough. We had to cut out the functionality just before starting the test.
Currently, we are logging the user if the password has been changed.

Despite these nuances, testing went well. For a couple of weeks, about 300 people came to us. We managed to look at the product through the eyes of users, test it in battle and collect high-quality feedback.

To be continued

For many of us, this is the first project of this scale. We have learned a number of valuable lessons on how to work in a team, make architectural and design decisions. How to integrate complex systems with small resources and roll them out into production.

Of course, there is work to do both in terms of code and at the interfaces of systems integration. The project is quite young, but we are full of ambition to grow a reliable and convenient service out of it.

We were already able to persuade systems. Bill is obediently engaged in counting, billing and user requests in his closet. The “magic” of tungsten fields provides us with a stable connection. And only OpenStack is sometimes naughty, shouting something like “‘WSREP has not yet prepared node for application use”. But this is a completely different story …

Most recently, we launched the service.
You can find out all the details on our website.

CLO Development Team

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *