Unexpected complexity of simple programs

More than once I was surprised when the project complexity assessment was announced: “Why so long?”, “Yes, right there, once, twice, and you’re done!”, “You can just take X and stick it in Y!” Programmers are accustomed to evaluating deadlines as time spent writing and debugging code, although large tasks involve a lot more.

Did you know that in reality icebergs are located horizontally in the waterrather than vertically, as in most stock images?

But even if you forget about the traditional bunch of enterprise gadgets like analytics, backward compatibility support and A / B testing and focus purely on the code directly related to the implemented functionality, you can see that its complexity often gets out of control.

In this article, I will tell you about several features that my colleagues and I have implemented in Joom at different times, from problem statement to implementation details, and will show how easily seemingly simple things turn into a tangle of complex logic that requires many development iterations.

Search by users

One of the large sections of the Joom app is the internal social network, where shoppers can write product reviews, like and discuss them, and subscribe to each other. And what a social network without user search!

Of course, searching is not that easy-looking task (at least after my previous article). But I already had all the necessary knowledge, and we also had a ready-made component in our company joom-mongo-connector, who knew how to transfer data from the collection to MongoDB into the Elasticsearch index, if necessary, adjusting additional data and doing some other post-processing. The task sounded pretty simple.

A task… Make a backend for searching by social network users. No filters needed, sorting by the number of subscribers will do for a start.

Okay, that really sounds simple. Customizing the overflow from the collection socialUsers in Elasticsearch by writing a config in YAML. On the backend, we add a new endpoint with an API similar to the product search API, but so far without support for filters and sorts (only the request text and pagination remain, that’s all). In the handler, we make a simple request to the Elasticsearch cluster (the main thing is not to make a mistake with the cluster!), From the result we get the IDs of the found documents – they are user IDs – according to the users themselves, then we convert to client JSON, hiding private information from prying eyes, and ready. Or not?

The first problem we encountered was transliteration. Usernames were taken from social networks, where users from Russia (and they were the majority at that time) often wrote them in Latin. You try to find Mads, and he is on Mads’ Facebook, and that’s it – he is not in the results. Similarly, Ivan will not be able to find Ivan, but I would very much like to.

This is the first complication – when indexing, we began to go to the Microsoft Translator API for transliteration and save two versions of the first and last name, and the general indexing component began to depend on the transliterator client (and still depends).

Well, the second problem, which is easy to foresee if your native language is Russian, but exists in other European languages as well – diminutive forms and abbreviations of names. If Ivan decided to name himself Vanya on Facebook, Ivan’s request will no longer find him, no matter how much you transliterate.

So the next complication was that we found an index of diminutive names on Gramota.ru (from Nikandr Aleksandrovich Petrovsky’s one-of-a-kind dictionary of Russian names), added it to the codebase as a hardcoded plate (some two thousand lines) and became index not only the name and its transliteration, but also all found diminutive forms (fun fact: in English there is a term hypocorisms for them). We took every word in the username and made a lookup in our humble spreadsheet.

A notarized screenshot of the Joom codebase. Circa 2018.

But then, in order not to offend the other half of our users, distributed in an uneven layer across the non-Russian-speaking world, we threw a cry to the Joom country managers and asked them to find us reference books of abbreviations of national names in their countries. If not academic, then at least some. And it turned out that in some languages, in addition to the tradition of having a compound name (Juan Carlos, Maria Aurora), there are also reductions of two, three or even four words into one (María de las Nieves → Marinieves).

This new circumstance deprived us of the opportunity to make a lookup one word at a time. Now we need to split the sequence of words into fragments of arbitrary length, and moreover, different partitions can lead to different abbreviations! We didn’t want to dive into the depths of linguistics and write artificial intelligence that abbreviated a Spanish name the way a living Spaniard would abbreviate it, so we sketched, forgive Knut, combinatorial overkill.

And, as is always the case with combinatorial searches, it burst on one of the users and we urgently had to cut into it a limit on the maximum number of generated spellings. This further complicated the code, which was so unexpectedly difficult for this task.

Machine translation of goods

A task… It is necessary to translate the names and descriptions of goods provided by the sellers in English into the user’s language.

Everyone has probably seen memes about the crooked translation of the names of Chinese goods. We saw them too, but the desired time to market did not allow us to come up with something better than using some existing API for translation.

It is easy to write an HTTP client, create an account and translate it into the device language when the goods are issued to the user. But translations are not cheap, and it would be wasteful to translate the same popular product into Russian for each of tens of thousands of views. Therefore, we turned on caching: for each product, we saved translations to the database and, if there were translations there, we no longer went to the translator.

But the potential for savings was still there. We decided that a reasonable compromise between translation quality and price would be to beat descriptions for sentences and cache them – after all, the same template phrases are often found in products, and it is wasteful to translate them every time. So our translator has one more layer of abstraction – a layer between the HTTP client and the cache that stores entire goods in different languages, which is engaged in breaking the text into fragments.

After the launch, the quality of translations, of course, haunted us, and we thought: what if we use a more expensive translator? But will it be good for our specific lyrics? You can’t compare them by eye, you need to do an A / B test. So in our translation cache, in addition to the product ID, the translator ID appeared, and we began to request a translation from the translator ID, depending on which A / B test group the user was in.

The dear translator performed well, but it was still too wasteful to run it on all products. But we went to countries whose national languages our main translator coped so poorly that we were ready to fork out for a successful launch; so the logic of choosing a translator became more complicated.

Then they decided that some stores on the platform are so good and the platform is so rooting for their success that it is always ready to translate their goods with a more expensive translator. So the logic of choosing a translator began to depend on the user, country and store ID.

Finally, we decided that over the years of Joom’s existence, our main translator could improve, and perhaps it makes sense to update the translation cache at some intervals. But what about without an A / B test? So the freshness field appeared in our cache, and things got complicated again. As a result, our translation component is incredibly complex, and this despite the fact that we have not even screwed any homemade computational linguistics into it yet. For now.

Converting clothing sizes

Perhaps one of the most painful problems when buying clothes and shoes online is choosing the right size. And if, when delivered from local warehouses, players like Lamoda can simply bring in several sizes at once and take the unsuitable back with the same ease, this will not work in a crossborder. Parcels take a long time, the cost of each extra kilogram is high, and their senders do not expect a large flow of incoming mail.

In addition, the problem is compounded by the fact that sellers from different countries may have completely different ideas about sizes. The Chinese M could easily turn out to be the Russian XS, and the terrifying 9XL may not be that different from the XXL. Stitched users have to rely on measurements, but even those are not always correct: for example, the user expects that the girth of a person’s chest is indicated, and the seller indicates the measurements of the clothes themselves – they differ by five to ten percent. We do not want the user to have to bother so much for shopping on Joom!

A task… Instead of the sizes provided by the sellers, show the users the sizes we calculated from a single table based on the girths.

Okay. We take a table of sizes, which we parse from the description of the product (this is done by a separate spacecraft for 5k lines) and is stored in a separate field, and we replace the sizes in it with the calculated ones. Hardcode the table for converting the girth to size, found on the Internet, and enjoy life.

But if there is no table or there are not enough rows in it, this does not work. The feature is disabled on the product implicitly number of times.

Hmm, in the table the girths of the human body, and most sellers indicate them by measuring on the things themselves. Sewing in the difference coefficient. Product manager Rodion, the happy owner of the perfect M-ki, goes to the mall, measures a bunch of different things on himself and comes with coefficients – they are similar, but differ significantly for different categories of goods. For a wrap-around turtleneck, the difference is almost 0%, and for a sweater, all 10%. Also, outerwear varies in fit: slim fit, normal fit, loose fit, and this gives a swing of ± 5%. Now our coefficient (immortalized by me in the code as Rodion coefficient) consists of two factors.

To determine the landing, we make another parser that tries to extract it from the name or description of the product. If the product does not fall into one of the categories checked by Rodion, the feature is implicitly disabled number two.

The final touch: A lot of products list the bust from armpit to armpit, meaning only half the girth, which results in ridiculously small sizes. We add logic that if the girth is less than X, then well, this cannot be, this is clearly half the girth, and we multiply it by two. It is good that adults usually do not differ from each other by two times the girth of the chest.

Now everything is so complicated that when testing a feature by the type of product in the admin panel, it is impossible to understand why it did not turn on or work in one way or another. We add a large layer of logic to the code, logging in detail the reasons for turning off the conversion. In order to be able to fully trace the cause of the shutdown on a specific product, you have to forward error messages upward, enriching them with details, several times. The code becomes terrifying.

And it all works differently depending on the group of the A / B test, of course.

Conclusion

Be afraid ~~Danai, bringing gifts~~ developers who are optimistic about the timing. It is very difficult to estimate the development time, no matter how simple the task may sound, and surprises await us at every step!