Hi, my name is Alexander, I am the team leader of the tracking solutions development team at Admitad.
Almost always during interviews I am asked questions about what the team is doing, what projects we have, etc. It takes a lot of time to verbally explain to the candidates how the Admitad CPA network works, and it is not very clear. Therefore, I decided to write an article where information about tracking, the services of our team and the tasks that we solve is collected in the form of answers to questions. I will also tell you how monitoring works for us, how microservices are better than a monolith, what is the use of QA, and a couple more interesting things.
What does the team do?
The tasks of the team include support of tracking services, research and development of new tracking solutions. We also implement in the code new features and business ideas that product managers come to us with. In addition, we solve internal tasks for the development of services and cover the technical debt (as without it). Finally, the team is obliged to fulfill SLA with the customer in terms of downtime of services and reaction to problems.
What is tracking?
Tracking is a tracing paper from the English word tracking, which translates as “tracking”. Let me explain right away that tracking does not mean espionage. Tracking of postal items can be cited as a familiar analogy: the parcel is registered in the mail and assigned a unique identifier so that the recipient can further track the status of its delivery.
The tracking of the Admitad affiliate network works in a similar way, only Admitad tracks targeted actions users for their partners.
What services are involved in tracking?
Our team develops and maintains Admitad transition and registration services. This is how their interaction with Admitad and its partners can be schematically represented (solid arrows indicate active user actions, dotted arrows indicate the reaction of services to them).
Explanations for the diagram:
- Affiliate link Admitad initiates a user redirect through a referral service.
- Click_id, or referral ID, is generated on the fly and does not contain any personal data.
- An example of a partner site can be an online store where a user wants to buy a TV (perform a targeted action).
What are the requirements for the transition service?
As for the transition service, there are special requirements for it: its operation should be invisible to users. An unnecessary redirect when going to a landing page should not cause delays, otherwise the user will be uncomfortable surfing the Internet. Recently, based on the transition service, we made new solution for partners – cashback services, which allows you to completely remove the redirect on the user’s side.
Since users live in different countries, the nodes of the transition service are also assigned to balancers located in different parts of the Earth. Thanks to this, the average server response time is ~ 10 ms in any region.
The workload ranges from ~ 1,000 to 2,000 RPS. Naturally, the cluster has a margin of safety in case of an increase in load, since the volume of traffic can suddenly grow by 1.5-2 times.
Since this is the main Admitad service, its 24/7 availability for users is ensured by the reservation of nodes in several DC. Click_ids are guaranteed to be delivered to Admitad via RabbitMQ queues.
The team, for its part, makes sure that there is always a stable release in production with up-to-date business logic. The quality of the application code is supported by the mandatory team review and automated test pipelines in CI.
If a bug has leaked to the product or memory on the server has leaked, notifications in Sentry and Slack will allow you to react to the error in time. And we observe the current metrics of iron on Grafana dashboards.
Now that we are familiar with the transition service and the requirements for it, I will tell you more about the team’s tasks using the example of service optimization.
Why share a referral service?
The transition service partially performs the tasks of registering actions, since it has a convenient distributed infrastructure, and the current user journey scheme looks more like this.
The functions of two services, different in purpose, are combined in one. Why did it happen? Historically, Admitad’s technology platform has been a monolith. In the process of development, it began to disintegrate into independent services. The current transition service is one of them.
We want to take advantage of the microservice architecture and clearly separate the logic. Having selected part of the transition service to the new service, we can deploy it separately from the old one. A separate deployment will allow you to more flexibly distribute the load between applications. It will be possible to scale only the service, the load on which is growing, and also experiment with traffic balancing schemes depending on whether the applications are distributed among containers on a node or across different nodes.
How to overclock the transition service?
The volume of traffic passing through the transition service is growing every year. On sale days, such as Singles Day 11.11 on AliExpress, the peak load is reached, so the company prepares for them in advance: code freeze is introduced, devops check scripts to deploy additional nodes.
Obviously, horizontal and vertical scaling of services increases the cost of maintaining the infrastructure. A more preferable option is to optimize the web application to withstand more RPS on the same hardware, so when writing a new service, it was decided to abandon Flask and go to aiohttp…
We (and not only us) we believe that the asynchronous framework will allow the new service to be overclocked so that it can hold a heavy load. Of course, you should not be guided only by assumptions, so the load testing process was launched. We started a dedicated node and deployed the application to it. Now using Apache JMeter and Locust service endpoints are loaded. By comparing the metrics of the old and new services on the same node, it will be possible to draw reasonable conclusions about performance.
Load testing turned out to be an interesting area with its own tools, many of which we tried out on our own. Write if you are interested to see an article about him.
In parallel with the development, we are engaged in legacy code research and optimization of the existing code base. In the course of dividing the services, endpoints were found that at one time, maybe, were needed, but no one already knows why. We set ourselves the task to figure it out. Removed a couple of service classes that don’t do anything else. These micro-optimizations are a nice bonus to the development process.
And how not to break everything?
As already mentioned, the transition service is a critical Admitad service, therefore, when transferring logic to a new service, you must not allow bugs that will lead to irrecoverable losses.
We asked the guys from QA to make us E2E tests, which will check the transition service as a “black box”. By changing the content of the “box”, you can make sure that the behavior of the new application for the user and Admitad remains the same. And connecting E2E tests as a separate stage in CI will allow us to deploy quietly on Fridays in the future (in fact, not).
We have plans to make the referral service better, faster and more reliable.
- By the end of this year, we want to profile it in order to find and optimize bottlenecks.
- A big feature is almost finished, which will allow you to customize the business logic of redirects with configurations without writing code.
- The new aiohttp registration service is ready. It’s time to bring it to production, we are waiting for a signal from QA.
It is quite possible that the implementation of our immediate plans will bring the transition service to a new round of evolution.
Thanks for reading this review. I hope the answers to the questions were informative enough to give you a general idea of Admitad’s tracking services.
If you have experience building loaded services on aiohttp or comparing the performance of different Python frameworks, please share it in the comments. It would also be interesting to read about the experience of profiling and load testing of distributed systems. What tools did you use? What do you recommend to read on the topic?