Test circuit as a lifesaver for frequent releases

In touch Igor Pomiluyko, Technical Director of Work Solutions. In May, I spoke at a PHP meetup where I shared a test loop setup case for a legacy project with technical debt. In this article, I will tell you in detail what problems we encountered and why we started solving them by introducing a test loop.
A little about the project
Now the project consists of two applications. In fact, one application is a copy of the second, but with minor differences. Each application is divided into services. We will conditionally call them Frontend, Backend and X-API.

What are these services?
front end
Everyone is probably already used to the fact that the frontend is an application on angular / react / vue or something like that. In our case, this is a standard PHP + JQuery application in which users work.backend
Provides an API for the frontend, it has its own database, as well as a large number of integrations with external APIs.X-API
Part of the business logic separated from the monolithic backend with its own database.
What other characteristics of the project should be highlighted?
Legacy stack: PHP 7.1, Zend 2, MySQL 5.6;
Large code base: 400 thousand lines of code per application, excluding templates;
Lack of autotests;
Complex subject area and several business domains;
Dozens of API integrations from various service providers;
No mock data for API;
Poorly structured code with classes for five thousand lines of code;
Hundreds of warnings from PHPStorm.

Beginning of work
First, we conducted a technical audit of the codebase, which resulted in a hundred-page document. Refactoring was required to solve the identified problems, but we could not start it.
The project suffered from 140 critical bugs, due to which the company suffered losses. They needed to be fixed as soon as possible. Therefore, for the next two months, we eliminated defects and released burning releases. In the process, new problems constantly surfaced, refactoring began to be perceived not as a recommendation for improvement, but as a necessary measure to save the project from technical bankruptcy.
And then we developed a roadmap to improve the system:

Key points to highlight:
Migrating to Symfony
PHP update
mysql update
test circuit
Code refactoring
Implementing Unit Testing
Provider Mock API Implementation
The diagram shows that many factors depend on the test circuit, so we decided to do it first.
What problems did we face
How it should have been:
There is a master branch for the sale and a dev branch for the test server. A developer makes a task on a branch created from master. Submits a branch for code-review. The reviewer merges to dev and sends it for testing. If successful, the task goes to master.
How it was in practice:
At some point, a bunch of tasks accumulated in dev. These are releases that are still too early to release to master, as well as tasks that have not been tested. And then we went out of sync with the master branch and problems with testing began.
In the branch from master, the code works, but after merging with dev, it does not work as we would like. There are constant merge-conflicts, which eats up time and notably spoils the mood.

Among other things, releases need their own test benches. For example:
We have a vendor API integration release that is being tested by the vendor itself. Moreover, we are not the only developers, so testing is divided into slots that can be occupied for a month or two in advance.
And just imagine: internal testers checked everything, our turn came up, and the developer breaks something while fixing a bug in a completely different place. Everything, test-failed, we are waiting for another month. It’s not good, so something needs to be done.
Test Circuit Requirements
Before proceeding with the implementation, we formed the main requirements:
Multiple booths can be running at the same time. We have 9 developers on the project. A dozen tasks can be implemented per day and the possibility of their isolated testing is needed;
Creating a stand should not cause difficulties and take a lot of time;
The same applies to the removal of the stand. Otherwise, there is a chance that the old stands will hang and consume resources;
Since the tasks are set in Jira, I would like to have integration with it: when transferring a task for testing, automatically create a site, and delete it when testing is successful. And it is also desirable to automatically add a link to this site to the task;
And of course, you need to know what sites we have launched.
Now we understand what we want, and we can begin to implement.
Test Loop: Implementation
Design
First, a detailed plan was drawn up for how the test circuit would work. In order not to describe the whole scheme, we list the main theses:
We have two servers: a management server and a testing server.
Stands are launched on the testing server, this happens in Docker and Reverse-proxy works on top.
On the management server, a CI server, registry, applications for convenient management of all this and integration with Jira are deployed.
Now we have a concept of what it will look like, but it is not yet clear how to do it and what tools to use.
Infrastructure as code
We decided to automate the process of setting up servers according to modern DevOps practices and tools.
Ansible
We created an infrastructure repository, saved all configurations using Ansible to it. For example, if you need to set up a server, add users, install cron or docker, we write an ansible role. Deploying control components is another ansible role, deploying an application too. This allowed developers to assemble test benches locally.

Jenkins
To prevent each developer from setting up stages on his computer, an automation server was needed. We chose Jenkins, which does all the main work:
1. Building basic versions of PHP, MySQL, Nginx images. The base version is a specific version of PHP and the utilities installed on it. In general, the whole environment except for the code.

2. Removing database dumps and packing them into images. Crown dumps are taken once a week, and the packaging itself takes about 30 minutes. But this is done once, and then the data in the form of ready-made images is launched on a test server.
3. The main task is to assemble the platforms. This includes building the images and running them on a test server.
4. Removing sites to free up resources. This is stopping the application, cleaning the stand and removing images from the Registry.
As a result, the following scheme is obtained:

At this stage, we already have a completely finished product, but inconvenient. Anyone who has used Jenkins knows that configuring it for each stand is still a pleasure, and it will be difficult to implement such a practice in a team.
Custom control panel
It was not possible to find a ready-made solution with a normal user interface, so we decided to write it ourselves. We chose the Django + Vue + Vuetify stack. With the help of one developer, they implemented the necessary functionality in just two weeks:

The control panel itself is quite simple and consists of several models:
Settings. For example, an access key to Jenkins or GitLab. Roughly speaking, key-value data store
Stand configuration. These are three entities: project, service and site.
A project is the required fields, such as type and code, and a set of project parameters, which may vary depending on the project.
The project has services. They also have required fields, such as symbolic code and repository, and additional fields that can be added in any number.
Together, the project and services serve as a template for the stand, that is, the stand parameters are formed from the settings of the project and services. Options can be overridden. These parameters are turned into trigger parameters for pipelines in Jenkins.
Integrations with Jira, Jenkins, Docker-registry, Gitlab and Traefik.
Traefik
Traefik deserves special attention here. In our opinion, this is the best reverse proxy for Docker in the world. We did not particularly check other tools, but when we studied how it is done in Nginx, we were horrified and decided to use Traefik.
Why is he so good? The fact that he can force the application to open on a specific link. For example, these 4 lines make the container available at the stage URL we need:

This is the whole configuration. Traefik itself also has settings, but they are just as simple.
To save power, we decided to implement a rule that running stands must be deleted every night, but quickly ran into a problem. The tester does not always have time to check the task before the site is deleted, and the next day he sees a 404 error.
Traefik helped in this situation too. Added a mini-application – 404 error handler. Now, when a tester enters the site, he can start the creation of the site by the button:

Registry + Portainer
All of our images are stored in the Docker-registry. In the current situation, it came in handy, since Elastic and Logstash can no longer be downloaded without a VPN. But they are now in our Registry.
Of the minuses, deleting an image by tag is not very convenient here. There is simply no method for deleting by tag. To delete, you must first request the hash of the manifest by tag, delete the manifest by the received hash, and run the garbage collector inside the registry. Only then will the registry clean up the space. Uncomfortable, but tolerable.
Portainer provides a visual interface for managing containers. It allows you to view logs, go inside containers without connecting to servers and restart them.
Results
We got rid of the pain of merging into the dev branch. Now we don’t waste any more time on this. We can get as many test benches as needed, and it’s done quickly. The site unfolds on average in 4 minutes.
Stage is isolated, so we also got rid of false returns of tasks from testing. This had a good effect on the spirit of the team – after all, it’s unpleasant when you did everything right, and the task returned with an unsuccessful test result.
We got a tool for running in new tools. An application in PHP 7.4 is launched in a couple of clicks.