“Sometimes it hurts.” How we created a cloud in Rosbank on ManageIQ

Never before has our cloud team spent so much time on GitHub. We deployed a private cloud based on ManageIQ in Rosbank and cut off a lot of, uh … difficulties called “Open Source is being implanted in Enterprise”. We will tell you what rake we ran through, and even take the liberty of formulating “what not to do” – it will be useful to us and those who follow in our footsteps.

Before the start of the cloud

V Rosbank there was already a private cloud on the good old VMware. It skillfully allocates virtual machines and up to a certain point this was enough. Once upon a time, advanced banking IT specialists became crowded within this framework. They wanted to “bring” other services to the cloud and give them to developers on the fly. This time. We also needed granular billing and measurement of consumed resources in order to clearly determine which departments use services to what extent so that they can honestly pay for them. This is about billing. This is two. And it also required monitoring and visualization to keep track of the entire cloud “economy”. It’s three. Another significant factor was the bank’s gradual transition to Open Source technologies. This is four. The sum of all these reasons was the beginning of the modernization of the private cloud. We had to not only build a steam engine with a hammer, screws and metal parts, but also stuff it with smart electronics.

Choosing what to ride

Together with Rosbank, we chose a management platform from three options – Morpheus, Terraform and ManageIQ. Morpheus is good, but not Open Source. And ManageIQ highlighted the completeness of the functionality. For example, compared to Terraform, it has a portal and other additional options – service provision scenarios, life cycle management of infrastructure objects being created, the ability to automate any IT services on the “everything as a service (XaaS)” principle, etc. It also turned out along the way that bank specialists have already dealt with ManageIQ – and this factor should not be written off.

We also evaluated ManageIQ from a development perspective. The product has been actively developed since its development team was acquired by Red Hat in 2014. The platform became the foundation of CloudForms. Then Red Hat became part of IBM. Since then, ManageIQ has started to support more services from IBM. But “IBM-ization” did not affect the “open source” prospects and openness of the code. So the use of ManageIQ is fairly popular among companies in various fields. And this product became the core of our “cyber steam” engine.

Process and rake walk

When we started implementing the system in the bank, we had to take into account a number of factors that turned into a spoke in the wheels of our project. COVID has become our enemy #1. Because of him, we could not work at the customer’s site. And a huge part of the work had to be done in the “native” environment of Rosbank.

We spent a lot of effort on finalizing and finishing both ManageIQ itself and related information systems. And even this was not the most difficult, but only added spice. Now that the main difficulties are behind us, we would formulate a list of the most painful points of the project like this:

  1. Cost allocation. She can change. For example, in our project, the cost allocation model has been radically modified. At first, in the bank, each ordered IT service was tied to a business group. A business group is a set of computing resources related to a whole line of business, so it could be voluminous. But as the infrastructure developed, it was decided to tie services to specific information systems. And it changed everything – from the principles of budgeting and billing to the role model and approach to multi-tenancy.

  2. Integrations. We had to integrate the cloud platform with 13 systems. Something was easy, like integrating it with Active Directory, GitLab, vSphere. But for many systems, the integration had to be developed from scratch – this affected NSX, the backup system (SBS), the data storage system (SHS), CMDB, ITSM systems, etc. Moreover, by “linking” the cloud with CMDB and ITSM, we suffered wildly, because these API systems were not prepared to work with ManageIQ. I had to modify the CMDB system. On the basis of ITSM, internal approvals were carried out. If the approval process needed to be changed, then we did it in the cloud, not in the ITSM system. Otherwise it would be too long.

  3. Safety. Any corporate environment, and especially the infrastructure of a bank, is built within the strict framework of information security requirements. Automation scenarios in the cloud can potentially open security holes. Therefore, security officers looked at the cloud “under a magnifying glass”, conducted penetration tests, and studied technical solutions in detail. Information security systems, originally designed for integration with proprietary software, had to be integrated with Open Source only with the help of improvements.

  4. Legacy. In any ecosystem, there is something very legacy that requires dancing with tambourines. For example, in our case it was Active Directory (AD). Historical layers in this system always resemble growth rings on an age-old tree. And there will always be a corner in it that speaks of the glorious historical past of the IT infrastructure. It took twice as long to automate the interaction with such a “property”.

  5. Visualization. Rosbank already used Grafana, but the bank’s IT department wanted to make it more visual and complete. And here Rosbank has done a great job: it formulated the requirements and took every possible care of ergonomics and the convenience of analytics for those people who will resort to it. So together we created detailed dashboards for different categories of users. For example, cloud users can see how much their ordered IT services are consuming. Management has access to reports on the workload and health of the infrastructure as a whole. For administrators – detailing the workload of individual infrastructure components, information about how many resources are left and how much can be allocated.

Such specific billing

Now it’s going to be a little bitter. IT infrastructure automation is not yet a cloud. There must be billing. And we did it too. You can say: “Stop! But there is billing in ManageIQ!” Everything is so, and he does an excellent job with charging the allocation of VMs. But Rosbank went further and wanted to calculate the consumption of other services – namespace on Kubernetes, storage resources (block and file), ELK cluster resources (centralized logging).

Billing algorithms in orchestrators are usually “linear”. They are poorly suited for use in domestic corporate IT, where many parameters are taken into account when forming the cost. In the public cloud, you order a virtual machine and it doesn’t matter what hardware it is physically placed on — the provider will issue it at a fixed price. And in a private cloud, everything is more complicated. The price for the old blade and the new server will be different. For different departments using the cloud, tariffs sometimes differ. The cost of the service is sometimes affected by the licensing policy of system and application software. And then a more expensive product suddenly turns out to be cheaper due to special agreements with the vendor. Part of the equipment under the cloud may belong to some business unit. In addition, companies often want to be able to easily change the calculation logic. As a result, the algorithm for calculating the cost of resources can turn out to be tricky and non-linear.

In general, Rosbank needed a customized billing system that would take into account all its needs. We have implemented it in two parts. The first component – suddenly! is an Excel calculator. All the magic of IT budgeting is in it. To create it, we collected about 10,000 infrastructure parameters: stuffing servers, their location, the parameters of each storage component, took into account how many people serve each piece of equipment, etc. Such a detailed study gives the exact cost of each service. At the exit from Excel, a ready-made price list is obtained, where each IT service item has its own price. The second billing ingredient is a Python application written by us, which calculates the cost of the ordered services according to the described price list and generates a report. Thus, to change the calculation logic, you just need to adjust Excel. What not only programmers can do.

If we recall the image of the “fashionable” steam engine we are creating, we can say that with the advent of billing, it will be able to issue a check for each turn of the gear.

DevOps and CI/CD

It is good when the program code is absolutely independent of the environment. In practice, this does not always work out. For example, the ManageIQ product itself was born long before the DevOps approach became mainstream. Not all product configuration is stored as code. Therefore, sometimes the transfer of a new release of services to the production environment becomes a task for superheroes. Much is done by hand, but sooner or later we will automate everything.

In Rosbank, the test and productive infrastructures were fundamentally different. In general, this is typical for any established and long-established corporate environments. The infrastructures differed in the AD structure, access policies, network topology, etc. We chose the standard approach – to transfer all differences in environments from the code to the configuration parameters. The problem is that at the development stage it is not always possible to predict what will need to be parameterized.

The code developed in the test environment did not work the first time in production. Therefore, while the prod had not yet become a real prod, we polished each service on it, then transferred the code back to the test environment, finalized it and carried it to the prod again. Anticipating this path, we have developed autotests for all services. As a result, we managed to achieve a more or less seamless code transfer. Phew!

In order not to get confused in this tricky scenario, we implemented the Ansible configuration management system. It helps us catch bugs in the test environment, start patch development, and then update the entire environment with it, including test, pre-prod and prod.

The drive to us in the project was added by the fact that we were dealing with pure Open Source. With a vendor, software updates are predictable: you can download a patch, apply it according to the instructions, and get the expected result. With Open Source, transitioning from version to version can be difficult. We ported several bug fixes from more recent releases of ManageIQ to ours, which already worked stably in the banking environment. We have automated the porting process using Ansible. The software is like a patchwork quilt, but it works better than just an updated version of the software.

Pitfalls of migrating to ManageIQ

If the customer’s requirements for the cloud are wider than simply deploying a VM from a template, then ManageIQ will have to work hard. The basic capabilities of the platform “out of the box” will need to be finalized. You can take ready-made repositories (for example, rhc-miq-quickstart) or develop something of your own. We left this project with a minimal must-have of what is useful for building a cloud on ManageIQ:

  • at least one specialist who codes in Ruby;

  • a well-developed concept and a clear understanding of the target solution – after all, finalizing Open Source, you can come to completely different results;

  • willingness to change the architecture during the project, if it is required by common sense;

  • focus on the functionality of the current versions of the software, since in non-commercial releases you should not rely too much on the exact timing of their release;

  • availability of resources for integrating new software with existing systems, especially outdated and exotic ones;

  • the ability to assemble a working system from already tested versions and pieces of code from newer ones, as well as post requests (issues) on GitHub.

However, while working with ManageIQ, we noticed that the project is really developing dynamically. For example, the speed of fixing bugs is sometimes much higher than the responsiveness of vendors of similar solutions. So, out of 10 major patches, we had to develop only three on our own, and the rest were taken from updates from GitHub.

In general, our steam unit with modern firmware has already earned and started its work. The developers have received a high-speed allocation of IT infrastructure, which was previously unattainable, the management looks at the analytics and understands the general state of the infrastructure, the calculations between departments are established, and we continue to improve this mechanism.

Authors:

Evgeniy Annenkov, head of complex projects department, Jet Infosystems

Anastasia Meshcheryakova, Jet Infosystems Project Manager

Vyacheslav Medvedev, Head of Jet Infosystems

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *