rewrite can't be refactored
At the same time, the web messenger as a platform consists of 4 products:
web version of vk.com;
fast chats (pop-up chats);
mobile version m.vk.com;
Desktop messenger VKontakte.
Not only did all these parts previously have different code bases (even the messenger inside vk.com), but they also had different priorities. The highest was vk.com with its mobile clients. Next came m.vk.com, for which they could no longer do something in order to save time and labor costs. In third place in importance was the desktop version. My team got this project already finished. It was once written in Electron by one developer, who then left the company. The team did not have knowledge of the code base, and there was no one to ask, so even CI had to be configured from scratch in a couple of months. And we supported fast chats on a residual basis.
What brought us to the fork
As I already said, initially VK Messenger consisted of 4 independent code bases. All of them were written a long time ago, and since then many changes have been made to them. Technologies have developed, some things have become simpler, some – differently. For example, many languages have begun to actively use typing. If the code is legacy, then it is not always possible to find the author and ask about its structure. Therefore, each time you are afraid to make changes, because you do not know how it will come back to haunt you.
In those days, after each release, our entire team watched the graphs with bated breath. And with any problem, we had to spend a long time looking for the cause: “What and where will break if I delete this piece of code and replace it with another?” Considering the size and structure of the code bases, even the tools in the IDE did not really help to cope with the task. Often, we had to use the usual search (in all sorts of creative ways) to understand whether you would touch something or not.
Examples of old code with screenshots, because there is no more of this in the codebases
In addition, all interactions between the web and the backend were built on acts. This is an outdated client-server interaction format that was used only on the web. We had to write and support them ourselves, because the backend team worked with the API. This created special errors and sometimes took away resources needed to implement new features on the server side.
The third difficulty was that there was no unification: each code base used its own technologies and approaches. One of my first tasks in the company was to mark a chat as unread. Probably, the guys wanted me to have an idea of all our projects, so they suggested that I do this task everywhere, I even wrote the acts for the backend myself.
The feature is very simple and does not require deep knowledge of the subject area. However, it took me about a month of work, because each time I actually started “from scratch”, understanding how the feature works in vk.com did not help to implement it in m.vk.com, etc. It took a lot of time to immerse myself in everything. I myself began to navigate the code more or less confidently only after six months.
How we chose which path to take
Something had to change. But the team's opinions were divided: either seriously refactor the code, or rewrite everything from scratch. There was a lot of discussion and debate, some minor changes were made, but for a long time the situation as a whole remained the same. At that moment, the problem was that both paths were scary and full of uncertainty. And it was impossible to quickly test anything. That is why our discussions remained speculative.
Rewriting is scary, there is a lot of responsibility and there is always a non-zero probability of failing. But refactoring in our case did not reduce the risks. During large-scale refactorings, the code becomes temporarily more complex – since only a part changes, the code must combine new beautiful interfaces and integration with old parts. In the case of rewriting, such combinations are shifted to the boundaries of the system, leaving the new “core” intact. It was clear that such refactoring could take years, which means that at least until the middle of the way we would be left with an even more complex system.
For myself, I identified several arguments in favor of rewriting from scratch. Perhaps I rationalized some of them when I wrote this text):
Refactoring did not solve the problem of heterogeneity of code bases. And it was impossible to “rip out” one and reuse it in other places.
New engineers may join the team. It will be incredibly difficult and expensive to simultaneously immerse them in old and new parts of the refactoring and the nuances of their combination. In our case, we simply connected newcomers only to the new codebase.
We (like many other teams in the world) saw the future in typing and really wanted to use TypeScript. In addition, I had a very positive experience using types to model the subject area, which helps to get rid of impossible states in the system. This is a common practice in the communities of other languages that I am interested in (Haskell, Elm, Rust). In order to get the most out of TypeScript, we would need to enable the most strict configuration restrictions, but due to the existing code, this was simply impossible. At first, I even tried to somehow change the situation – I walked around the entire vk.com monorepo and edited the code of different teams. But in this way I managed to “tighten” the TypeScript config only by one setting, and then I just gave up.
We have already tried refactoring individual parts, but it did not lead to significant results. The existing approaches in the code bases drove us into certain limits. Everything ended up with more cosmetic changes.
And in the case of rewriting, we still had a clear path to benefit from it immediately, and not years later. Firstly, we could quickly replace and revive fast chats, which are displayed on each page and help involve more users in the messenger. Secondly, one of the important goals of the new code base from an architectural point of view was “embeddability”. Therefore, the idea of embedding individual significant pieces into the old code that would work on the new state and UI appeared quite quickly.
But we argued not only about whether to refactor or rewrite, but also about how to organize the code, how to describe models, business logic and everything else. One day, I got so tired of arguing that I decided to try writing a new prototype of a desktop messenger in my free time. I thought that it would be much easier to justify the advantages of my ideas using examples of real working code. I spent about three weeks on this alone, and then connected another senior member of the team (hello Tim Chaptykov!), who helped polish it for a couple more weeks. The prototype could do no more than 10% of what was implemented in the current clients, while not all of them supported some of the features from the prototype. In addition, there were examples of implementing key parts of the messenger – working with chat lists and receiving and processing events from the server. Then we showed this prototype first to the entire web messenger team, and then to all the other messenger teams at the next demo. The feedback was very positive, and the prototype was a good proof to our management that we could both start and finish. And in order not to take a break for several years, we agreed to rewrite it in parts. This took a lot of time – the heterogeneity of the code bases and the technologies used in them had an effect.
What steps did we take?
So, we decided which way to go, but there were still many problems ahead. For example, the huge number of integration points – several dozen.
For example, if a user starts playing music in a messenger tab, it should continue playing when switching to other tabs and when the browser is minimized. The built-in VKontakte player was already able to do such coordination, so it was also necessary to integrate with it. Or, say, so that an incoming call in the messenger would come to the current tab, and not to all the others.
All these integration points were partly legacy: no documentation, no type descriptions. It was necessary to do real archeology, understanding how the code worked.
At the same time, it was necessary to preserve all the functionality that already existed. Imagine that you suddenly lost the ability to attach some types of files that were previously supported in your correspondence. Because of these nuances, we spend most of our time on the integration of components with each other.
Although we chose the path of rewriting from scratch, we decided to take some components either from those that existed in the messenger or from colleagues. For example, after integrating fast chats, we borrowed the text input field itself from vk.com. However, in the end we still wrote our own, because we needed integration with widgets and the desktop version.
vk.com has an effective mechanism for determining master tabs based on a consensus algorithm. Thanks to it, we reduce the number of frontend requests to servers. Let's say you have ten VKontakte tabs open that can access the network for updates from the server, and in order not to overload the backend, only one of the open tabs knocks on the network, receives data for all of them, and then “distributes” it to the other tabs.
I can't say that some features were particularly difficult for us, but there were tasks that took us a lot of time. For example, VKontakte supports a huge number of attachment types: pictures, videos, audio recordings, broadcasts, playlists, artists, podcasts, documents, links, links to join, products, albums, gifts, and so on. It took a lot of effort to understand all the nuances of how these attachments work: to find out from other teams what data needs to be received and how to display it, how it is all integrated, and then rewrite it from scratch.
The complexity and errors were also added by the fact that VK Messenger is a very lively and interactive part of VKontakte. Something is constantly being updated here, some actions initiate new ones. Therefore, in order to improve overall performance, we spent a lot of time on optimization.
What we have achieved
Thanks to the move to a separate repository, VK Messenger is now a kind of SDK package that we install and use in different parts of VKontakte with minor modifications. We take a code fragment with new features and add it to all 4 versions with minor adaptations. Embeddability was one of the key requirements for the new codebase and we managed to cope with it well thanks to carefully designed DI interfaces. And the fifth project that was made on the basis of the new SDK was the messenger widget, which can now be found, for example, in Mail.ru mail. We plan to replace the old community message widget with it, which third-party sites could previously embed.
It's harder for us to make mistakes now: the messenger currently has one of the lowest levels of errors in the code. The reason is that we changed our approach to them. When rewriting code bases, we began to follow a different approach: we avoid default values, log all errors and analyze their causes in detail.
One of the important goals for us was to seriously improve DX. Since we changed the approach to error handling and seriously invested in modeling our subject area with TypeScript types, it became easier to understand the features of the product, and much harder to make mistakes due to oversight. An excellent proof of this is the onboarding of new people to the team. If at the time of my joining the team, it was possible to roll out some serious improvements only after several months, then in recent years the guys can roll out even fairly large tasks already in their first month.
Previously, the messenger could render twice – in kPHP for server-side rendering and on the client for interactive parts. In some places we could share “templates” between kPHP and JavaScript, and in others we had to duplicate the markup logic. In the new code base, on the one hand, we could not afford server-side rendering (for this we had to use Node.js, and we did not want this from an infrastructure standpoint), and on the other hand, we wanted to speed up the cold start of the messenger tab. Therefore, we made an intermediate option – we “broadcast” our requests for initial data to PHP and reuse it when loading the page to create a cache. Thanks to the ability to stream a response from the server, which KPHP recently implemented, we can simultaneously prepare data and return HTML and statics.
We also started generating typing for this API based on the backend API scheme. This helps avoid errors like incorrect argument passing. Now all web commands use API types, but once upon a time, the messenger was an early adopter, bringing feedback and ideas to the implementation. VKontakte's highly developed infrastructure also helps us with this: we use a testing system, load data directly from the backend so that users can open pages faster.
Another interesting change is the text parsing mechanism. Each message is broken down into fragments (chunks). Previously, regular expressions were used to search for them. It worked in a specific way, and it was hard to update. Now we parse an array of semantic chunks, from which we can then assemble a single text the way we need.
We have completely changed the way we work with language keys in the code. Previously, many functions in different combinations were used for this. It was easy to make a mistake in the name of the key, and then the user would see unreadable messages. It was impossible to check. We took advantage of the “major overhaul” and wrote a library that also strictly types all language translation keys. It also protects against errors and makes it very easy to track the use of specific keys and delete them if necessary. At the same time, we use only one method with a clear algorithm of work.
The story of one fuckup
So that no one thinks that everything was always perfect, I will tell the story of a small fuck-up. My own fuck-up.
Along with the request to send a new message, we put a unique number, let's call it the message ID. This is necessary to later compare the optimistic state with the server's work result. So, I decided that a good way to form such a number would be the current date. At the same time, our unique ID should fit into an int32, which means the current date needs to be compressed to this size. It would be absolutely logical to take its remainder from dividing by the desired range and be done with it. I will not go into details, but for some reason I decided to overdo it with mathematics and take the modulus of a certain time interval, which turned out to be more than half of an int32 (we take only the negative range).
As a result, at some point, sending messages in fast chats simply stopped working due to int32 overflow. And this update was already in “production”, albeit for a small part of the audience. Fortunately, this was at a time of minimal load, and we quickly fixed the problem.
About the team
The VKontakte messenger team is the best I have ever worked for. Both in terms of professionalism and in terms of work ethic. When the team was small, we didn’t even have any plans, no one assigned tasks to anyone: everyone simply had a huge initiative. Having completed a task, everyone would immediately take on a new one: make a feature, fix errors, etc. And they didn’t just try to close the ticket, but sincerely thought — and think! — about how to make it better for the user. You probably meet people like that everywhere, but for us, everyone does. And when you are surrounded by such comrades, it motivates you even more, gives you energy. For me, it’s an incredible experience.