Move to Kubernetes and pay for infrastructure in half? Adapty history

We recently helped our client Adapty migrate infrastructure from AWS managed services. Now it is hosted in a Kubernetes cluster on regular instances of another cloud provider, but it can be easily migrated to another data center if necessary. This business case is indicative in many ways: Adapty has been able to minimize vendor dependency, reduce infrastructure costs by 50%, and remove some of the technical constraints on scaling and optimizing its applications.

We will tell you why Adapty decided to leave AWS, what difficulties arose during the move, what role it played in the Flant project, and also about the advantages of kubernetizing the service.

Notes to the article

1. This is not a technical guide to action, but a business case. The technical details it mentions help you see the big picture from a business perspective.

2. Although we provide some case-by-case figures, the NDA does not allow us to name specific budget figures for Adapty. Therefore, when it comes to economics, relative values ​​are given.

About the service

Adapty Is a service that helps app developers to increase revenue from internal subscriptions. The service includes a set of different tools: Paywall A / B tests for better conversion of users into paid subscribers, a dashboard with the necessary metrics, an SDK for connecting in-app purchases, etc.

Adapty clients are mobile applications. Some apps can have up to millions of subscribers; In total, Adapty serves 120 million application users. With the growth of customers, traffic grows proportionally. As you know, traffic on AWS is not free. So, Adapty’s price tag grew as steadily as the audience of the service. This became the main motive for moving to Google Cloud Platform (GCP).

Reasons to opt out of AWS

Increase in the cost of traffic. Adapty initially decided to host their entire infrastructure on AWS. The team arranged the preliminary calculation of the monthly payment in the Amazon calculator. However, this calculation took into account the cost of only fixed payments for virtual machines (VM) and managed services. Non-fixed payments – including for traffic – were not counted by Adapty.

The main problem was outbound traffic – the data that Adapty servers transfer to the clients’ mobile applications (Data Transfer Out From Amazon EC2 To Internet). Due to the growth of the client base, the total traffic volume has grown to hundreds of TB per month. As a result, by the time of migration, more than 60% of all receipts for AWS services were just traffic payments.

High IOPS cost. The second major expense is IOPS. The Adapty database requires high disk throughput, and additional IOPS comes at a price. For Amazon, in most cases, this option is significantly more expensive than, for example, for Google.

Vendor lock-in. The entire Adapty infrastructure consisted of Amazon managed services. Elastic Container Service (ECS) was used as the container orchestrator, PostgreSQL databases – Relational Database Service (RDS), load balancer – Application Load Balancer, logging – CloudWatch, Redis – ElastiCache, for analytics we connected the Kinesis-Lambda-S3 bundle. All these services work only in the Amazon cloud, they are not migrated or replicated anywhere. In fact, Adapty has become dependent on one platform. If something went wrong, it would not be possible to quickly migrate to another infrastructure.

Functional limitations of managed services. Many AWS Managed Services are user-friendly in terms of how easy they are to set up and maintain. The downside of convenience is limited functionality. This applies, for example, to the ECS container orchestrator: it would be incorrect to compare its capabilities with the capabilities of Kubernetes. The same is with the Relational Database Service (RDS) database, a simplified version of the cloud-based PostgreSQL, which, for example, does not have a binary replica.

Adapty chose GCP for several reasons, the main of which is the lower cost of traffic and IOPS. However, another provider could have been in place of Google.

Migration and difficulties

Our SRE / DevOps team helped Adapty migrate the infrastructure and then took it over to support it. During the migration process, not everything went according to plan, and not in the time frame that was originally expected.

The main work was completed in about 3 weeks. But taking into account the preparation and stabilization of some services after the move, the whole project in general took about one and a half months.

Facebook post by Vitaly Davydov, CEO and co-founder of Adapty
Facebook post by Vitaly Davydov, CEO and co-founder of Adapty

What we did to prepare for the migration:

  • We built a Kubernetes cluster on GCP instances.

  • Planned how and when to migrate the database.

  • We worked on the transfer of the load balancer and traffic switching.

  • Organized CI / CD (build and deploy) for applications.

  • Planned the transfer of applications.

And now – about the main difficulties that arose in the process of moving.

Database

One of the critical limitations of Amazon RDS is that you cannot physically replicate a database to another cloud or to your own VM. That is, in fact, the database had to be rebuilt.

The base itself is quite “heavy”: in AWS it took about 1.3 TB. The database is updated frequently, and actively: in just an hour, up to 1 TB of data can be updated. It didn’t make the task easier either by moving from the 12th version of PostgreSQL to 13.

As a result, the transfer of the database took place in several stages and became the most time-consuming part of the project. The work took over 2 weeks.

Traffic balancing

To block and serve the HTTP code 429 AWS used Application Load Balancer. The balancer in Google – Cloud Load Balancing (CLB) – has a different structure, it has different functionality. After moving to CLB, we set up HTTPS termination and the same filtering as in AWS, but spent a lot of time setting up CLB, because it is much more complex than ALB. At some point, I even had to involve Google engineers.

Logging

As an alternative, CloudWatch used a system based on a proven Open Source stack: Elasticsearch, Logstash, Kibana (ELK). To implement the same functionality in ELK that Adapty had in CloudWatch, the system had to be heavily customized. Adjustment work continued for some time after the move.

Moving results

After moving from AWS, only the analytics service (Kinesis + Lambda + S3) remained in it. The rest of the services that run in production are now deployed to instances in the GCP cloud, mostly as Open Source solutions.

It is important to emphasize here: Adapty has moved to regular virtual machines hosted by Google. On top of the virtual machines is Kubernetes, which is not tied to specific provider services – our Open Source platform Deckhouse; it is a CNCF certified K8s distribution that works on any infrastructure. Instead of a Google VM, you can equally well use instances from another provider, including your own OpenStack installation or bare metal servers.

The only remaining vendor binding is Google’s managed service in the form of a Cloud Load Balancing load balancer, which accepts client traffic and routes it to Ingress in Kubernetes. However, now it does not seem to be a big problem, and in the future it can also be decoupled from the vendor lock-in: nginx deployed on a VM or on a dedicated server will handle balancing.

You still have to pay for virtual machines, IOPS and traffic, but it comes out significantly cheaper than in AWS:

Adapty Infrastructure Before and After Migration
Adapty Infrastructure Before and After Migration

What we managed to save a lot on:

As a result, the total price tag for cloud infrastructure dropped by 50%.

Why was this migration entrusted to us as contractors? CTO at Adapty, Kirill Potekhin, does not consider server administration to be a key expertise of the company, does not see this as a business advantage. For Adapty, it’s more profitable to send DevOps questions to those who are good at it in order to focus on the product themselves.

“If we ourselves tried to move to Kubernetes, we would spend significantly more time to understand how it works, how to work with it correctly. We also, for example, do not have the expertise in how to set up monitoring on our own, while Flant had a ready-made monitoring system to replace CloudWatch. With its help, we quickly found several ineffective queries that heavily loaded the database, and rewrote them, reducing the load. We didn’t even think about some metrics that we needed to look at them. But, as it turned out, this is important in the context of Kubernetes. “

Kirill Potekhin

CTO Adapty

Technical bonuses from moving to K8s

By moving to GCP, Adapty got rid of the vendor binding of core services. Taking into account the migration to self-hosted Kubernetes, this gave independence from the IaaS provider: the cluster can be transferred to another cloud, including dedicated servers, or to your own hardware. The new infrastructure brought other benefits as well.

Autoscaling

In the case of Adapty, the dynamic autoscaling implemented in Deckhouse turned out to be very relevant. A client has sudden spikes in traffic: for example, RPS from an average of 2000 can quickly rise to 30,000 – and just drop just as quickly. For such situations, applications were raised on AWS virtual machines in dozens of copies, “in reserve”. These VMs were idle for half the time. K8s is now configured to dynamically autoscale nodes and pods. This works thanks to the CCM (cloud-controller-manager) module, which orders nodes in the cloud by itself (supported by AWS, GCP, Yandex.Cloud and others). As soon as the need arises, additional Pods are deployed on these nodes. For faster launch of Pods, additional ordered nodes can be kept in standby mode.

Our deployment utility also helped to optimize autoscaling. werf… In the case of Adapty, HPA (Horizontal Pod Autoscaler) is used. For example, in Deployment, the number of replicas could be limited to 25 Pods, and a lot of traffic comes to the service. Then HPA increases the number of Pods, say, up to 100. If at this moment the developers rolled out the update, the number of Pods was reset to 25 and the service began to slow down. This was solved by using annotations ​​replicas-on-creation in the process of rolling out the application via werf and resetting the parameter replicas: N in Deployment.

Deckhouse has a similar automation for disks: the platform itself re-orders them. For example, a database needs to be deployed on a cluster and requires a persistent disk. It can be ordered through the Deckhouse storage classes, that is, it is enough to specify the required size and type of disk in the YAML manifest. By the way, another convenient moment with disks is the ability to simply expand the PVC (PersistentVolumeClaim) volume in a cluster or YAML with a subsequent re-rollout.

Pod stability

Amazon ECS tasks (analogs of Pods) often crashed from memory and did not restart. While the cause of the first problem can usually be identified, the second is no longer. Adapty had to write a “crutch” script that checked every 30 seconds that nothing had crashed and, if there were problems, restarted the tasks. After moving to K8s, these problems no longer exist. If a Pod crashes, it restarts itself, and the reasons for what happened can be identified with standard tools.

CI / CD

Conventional Kubernetes with CI / CD can be easier than ECS because industry standard tools are available. In our case, of course, the already mentioned werf has become such, but nothing prevents you from using something else that will be more familiar and / or convenient for the engineers involved in the project.

Future plans

The next step in optimizing infrastructure and costs for it, Adapty plans to transfer some services to its own hardware in one of the local data centers. For example, the machines that are responsible for the API and to which almost all the traffic goes are stateless: they do not store data, but simply execute code, so they can be transferred to a cheap data center. Thus, you do not have to pay for traffic at all. And Adapty will keep the critical infrastructure, such as the database, in GCP because Google provides a sufficient level of fault tolerance and availability.

PS

Read also on our blog:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *