How our PIM system works, and why we don’t use react or microservices

General overview of the architecture

In this article I will not consider organizing code at the assembly or class level; we will talk about higher-level things: services, web applications, APIs, and so on.

All this is required for work

All this is required for work

I have identified nine main subsystems in terms of the tasks they perform, and combined them into five groups. Some of them can scale horizontally almost unlimitedly, some exist in one copy and cannot work in parallel, but, fortunately, they are not required (we will never have a billion users, for example). I got the following groups:

  1. Interaction with users and APIs. I combined them, essentially this is a web project that generates pages and provides an API. I also included a CDN here.

  2. Services for processing long-running tasks. For example, a user uploaded a file with a million records for automatic matching, it will take 5-10 minutes to process. These services are responsible for processing this type of task. Can be scaled horizontally with almost no restrictions.

  3. Data collection. PIM is designed to work with product information. Sometimes, if the system is implemented by the manufacturer, this information is unique, and is essentially created or organized in the PIM system. In other cases, it is not unique, it already exists somewhere on the Internet. In this case, we can help with the characteristics of the goods. To do this, we created a parsing system; by this time it had collected information about 73 million products with their images and characteristics.

  4. Data storage. Essentially, servers with databases. Each client directory is stored in a separate database, which allows you to scale horizontally with almost no restrictions. The only limitation is that one directory cannot be distributed between different servers.

  5. Auxiliary services. For example, a service for deleting old files uploaded to the system, a search index, a file storage service.

Interaction with users (interface)

Here the situation is quite standard, and at the same time quite non-standard. Quite standard – because .NET 8 is used (I’m completely confused about the version numbering, in general, this is the latest version of the web framework from Microsoft), it is a fairly popular framework. Non-standard – because, contrary to modern trends, our pages are rendered on the server, and scripts on the client simply improve their behavior. Third-party dependencies include jQuery with several plugins.

An example of what a product page looks like

An example of what a product page looks like

What does this give us? We have a small team of four people and everyone can do everything. You don't need to be a react expert if you don't use it on a project. The layout started – it took about five minutes to fix the problem. You need to add data to the page – it will take the same conventional five minutes.

Debugging is also simplified as much as possible. The entire page formation process can be tracked in one window step by step.

Another advantage is that any task can be performed by one person; developers do not need to coordinate changes in the database structure, changes in the API and changes on the front, as happens when each team member specializes in one thing – the database, server or front .

This approach also has disadvantages, but we have not yet matured to them. They will overtake us when we have thousands of clients, tens of thousands of users and a development team of 30 people, and that is still quite a long way off.

The same framework is used to develop the API. For documentation and interactive debugging – swagger. Documentation is generated from comments in the code. By the way, I highly recommend it to everyone who for some reason has not yet used it.

But in addition to the API documentation, we also need a knowledge base for users. Usually they use Confluence or another wiki engine for this, but we took a different route and store all the documentation as Markdown files in the same project repository. It’s quite easy to assemble some kind of wiki from them: markdown is converted into html, navigation, table of contents, and so on are added.

This approach addresses several important issues:

  1. Documentation is tied to code. The functionality in production was updated, and the documentation was also updated synchronously.

  2. All Git functions are available for viewing history and version control.

  3. The development of functions and documentation for them occurs in one environment, there is no need to switch to another system and fall out of the flow.

  4. There is no need to administer a third-party system or service.

  5. From the code, simply refer to the documentation, make links to the necessary documentation pages on the service pages.

There are also disadvantages; maintaining a large knowledge base with a large number of cross-references and images will be labor-intensive and inconvenient. But – that's for later, we have not yet encountered such a problem.

In the same block I placed a balancer and CDN, in our case these are third-party services configured to work with our system.

Services for processing long-running tasks

First, a little context: our system can work with catalogs of several million products. In this case, tasks often arise that affect many products at once, for example, matching records from a file uploaded by a user to a catalog based on fuzzy characteristics. The file, in this case, can also contain the same number of lines. The algorithm that makes such a comparison requires the formation in RAM of various data structures optimized for this task. I once talked about this in detail.

Example of what the matching algorithm does

Example of what the matching algorithm does

Tasks for generating large files, interacting with third-party services, and so on are also processed. For example, these services must check configured mailboxes for new letters, send mailings, and update data from marketplace APIs.

So, reading everything from the database, recalculating the statistical characteristics of tokens and building structures in memory is too long to do this every time they are needed. Therefore, we try not to unload them from memory while there is enough memory. And when there is little free memory, we unload what has not been used for a long time. It's easy when there is only one server.

On the other hand, we want to be able to scale this service horizontally, and this process should be as simple as possible. And if one of the servers becomes unavailable, then other servers should take over its tasks without any human participation in this process. In both cases, it is required that after adding or leaving a server, memory is used rationally, that is, search structures are loaded into memory in one copy. In addition, it will be great if you can get by with a minimal recalculation of data in memory, that is, so that the redistribution of tasks between servers leaves everything “as is” as much as possible.

This is the task of these services – to process long-running tasks and scale cleverly. For such scaling, we ourselves came up with and implemented an algorithm that satisfies all our conditions. But this is a topic for a whole article.

I think you have already guessed that here we use .NET as the main technology. And, to be honest, I don’t see any alternatives.

The first is predictable memory usage and predictable and manageable execution speed. Here I mean that we can quite accurately say what and how is located in memory, how much memory will be allocated for a particular structure or class. Plus, the .NET virtual machine itself is incredibly fast.

Secondly, good and convenient profiling tools (a subscription to DotTrace and DotMemory for about $12 a month fully pays for itself).

Thirdly, the ability to use unsafe code in bottlenecks. For example, we need to process images in a certain way. We make the first version on GDI+, when we understand that the algorithm does what is needed, but slowly – we rewrite it into unsafe code and get a speedup of tens of times.

Fourth, the C# language is very pleasant to use. In my subjective opinion.

And yet, .NET allows you to develop a web application, API and services equally conveniently, using the same language and development environment. For a small team this is important.

Background service

In essence, it is very similar to the services from the previous paragraph, but does not scale.

He is responsible for maintaining the project in working order: he deletes old files, monitors the fragmentation of indexes in databases and initiates a recalculation if something happens, creates tasks for services from the previous section in accordance with the schedule configured by users, and so on.

Data collection

I have already talked about this part in detail in article about how we bypass Cloudflare blocking and other services for automatic screening of robots. Let me remind you briefly.

We have a service that crawls sites from our list for which we have implemented parsers. It loads page after page and tries to parse the information we need: the name of the product, its price, characteristics, pictures. If something on the page is loaded asynchronously, our service also loads this data. All this happens without launching a real browser; our code does this using the AngleSharp library (I recommend).

But not everyone likes having their site crawled by robots. In this case, IP verification and automatic captcha are often used to filter out automatic requests. IP checking typically blocks requests from hosting providers and allows requests from ISPs. To pass it, you need resident (resident in this case means that the IP address belongs to the Internet provider’s subnet) proxy servers. And to pass the captcha – a real browser, launched in graphical mode, automated using the extension we developed. All this is described in detail in the mentioned articlethere is also a link to GIT, we have made the extension and service code publicly available, it is available to everyone.

In addition, you need to download, process and save the pictures. Processing involves their deduplication, calculation of perceptual hashes and related tasks. Perceptual hashes are needed to identify images with similar content.

We directly collect data, as well as process images, using cheap virtual machines at 4 euros per unit per month, while passing captchas are handled by more expensive virtual machines in conjunction with a pool of proxy servers.

This subsystem scales well horizontally; you just need to add a new server, and it will take on part of the work. This is built on an extremely simple principle: each of the servers works independently, just regularly randomly taking a small portion of work from the queue. More servers means the queue moves faster and information is updated faster.

The database in this process is not a bottleneck; work with it is optimized to simple queries for proofreading (usually from one table at a time with a condition on the primary key) and batch or bulk operations for updating and adding data.

Data storage

A separate database is created for each directory. Another one – for each company account, this is for storing data from a private library. Another one for the entire service is data from the shared library. And one more thing is information that is not included in the directory context: users, sessions, column visibility settings in tables for each user, and so on.

Thanks to this approach, data storage is easily scaled horizontally; we just need to remember on which server the database is located for each directory.

Microsoft SQL Server is used as a DBMS without any alternative. But this was not always the case, at first there was PostgreSql instead, but at some point we decided to compare performance, fortunately at that time we did not yet have an abundance of optimized queries, and almost everything worked through the Entity Framework and the queries generated by it. A simple switch to Microsoft SQL Server speeded up data access in our work scenarios by one and a half times.

The second argument is that we needed bulk operations to quickly insert a large number of rows at a time. There was a ready-made driver for .NET for MS SQL Server, but not for PostgreSql. To do it ourselves – I'm not sure we could handle it. What I am sure of is that the time spent would never have been worth it. In fact, this is the most critical point and we started the move precisely because of it.

And the third point: working in SQL Management Studio and SQL Profiler was much more pleasant than in similar utilities for PostgreSQL.

In general, we took it and changed the DBMS quite quickly, thanks to the ORM for this. Who said that no one does this, and this is a dubious advantage of ORM? Although, to be honest, we could afford it then, but now we can’t, too many specific features are used.

I'm not saying that PostgreSql is bad, maybe we don't know how to work with it properly.

File storage

File operations are hidden behind the interface, and we don't care what will be used to store files. Implementations for working with the local file system, WebDAV and SMB/CIFS protocols are now ready. In principle, it is possible to integrate a proprietary API.

In general, what is specified in app.config will be used. When developing on a local computer, we usually use a local file system; in production, we rent file storage as a service and work with it via WebDAV and SMB/CIFS. Why with two at once – I won’t go into what they did, but in a nutshell, we have different scenarios for working with files, and in some places one protocol works better, in others another protocol.

Search engine

And I’m already talking about him wrote in detail. We developed it ourselves and optimized it specifically for our tasks. Now, two years later, I can say that the approach was completely justified, fuzzy search for 73 million products works almost instantly (tens of milliseconds), and the server’s uptime seems to have been more than a year.

You can view the work without registration: soft toy goose 160cm

The implementation took one day, then only a few small changes were made, which took another one or two days in total.

Integrations

This is the biggest pain and problem. The PIM system urgently needs integration with other systems, and among them are Ozon and Wildberries. These two took up between a third and half of the time spent developing the entire system. It was thanks to them that I learned to construct compound and complex sentences without a single censored word.

Logging

I chose the logging system back when I was working on the project alone. I had several requirements for her in my head:

  1. Automatic merging of logs related by one context. In our case, this could be processing an http request, working on a long-term task, or one iteration of a background service.

  2. Access via a web interface with search, filtering, and so on.

  3. Speed ​​of work. Logging should not waste a lot of resources. Failures in logging should not lead to deterioration in the responsiveness of the main system.

  4. Preferably a library, not a separate service.

In general, I wrote the logging system myself. Since then, we have only had to make minor changes there a couple of times.

Conclusion

I did not touch on the topic of organizing the development process and interaction in the team, as well as the issues of updating versions and organizing backups, I’ll talk about this another time.

The hardest thing is to understand what to do and how to sell it. At our stage of development, the most important thing is that technologies should allow us to quickly “cut features” and adapt to the market. This can be approached in different ways; in our team, we have developed an approach in which everyone can complete any task.

To do this, we had to simplify the process of deploying and debugging the project as much as possible. Almost every scenario can be repeated and walked through the steps without leaving the Visual Studio window. That’s why we don’t use microservices, message brokers, data buses and other infrastructure dependencies: they interfere with debugging, they disrupt the flow, everyone needs to keep in mind the known features of their work, and figure it out when faced with unknowns.

To start working on a project from scratch on a new computer, all you need to do is install SQL Server, pick up the latest version of the code from Git, build the project, go to the admin panel and click on the button to create the databases needed for the project.

Another, no less important, consequence is that you can quickly and easily deploy the boxed version, and this can be done on almost any equipment, and without dependencies on third-party services.

So far, this approach has not stopped us from developing the project. And when it starts to interfere, I’ll be very happy, really. This will mean that the more important issues – issues of promotion, sales and financial model – have been resolved one way or another.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *