billions of combinations in 12 minutes

Hi! My name is Kirill Egorov, I am a technical leader two units Avito: “Spare Parts” and “Construction and Repair”. In this article, I talk about how we determine which cars the parts from the ads are suitable for, how we use Golang to sort through billions of options, and what difficulties we had to overcome when implementing this solution.

The task of the “Spare Parts” unit is to help the user easily and quickly find parts that will fit his car. The application has a built-in “Garage” section – you can add your car there and get search results with parts specifically for this car. You can also use the more familiar option with filters by car within the desired category.

We know about:

  • >40,000 car modifications;

  • >40 million unique spare parts and this does not include liquids and accessories;

  • >300 million crosses: bundles of analogs and replacements of spare parts;

  • >120 million original compatibility: links “part → car”;

  • >8,000 brands with multiple spellings for each;

  • >100 related brand bundles.

All this data needs to be sorted through to understand which parts fit which cars, and that’s billions of combinations.

The complexity of the world of auto parts is that not all parts have data on their applicability to a car. Compatibility has to be determined by indirect signs: through analogs, related brands and various tricks.

Metrics

We are a product team: we test hypotheses on data, closely monitor technical and product metrics. Each change in the compatibility processing logic is tested on a stand before going into production to assess the quality of the results and collect metrics.

Important metrics we track in this case:

  • coverage of spare parts compatibility – for what volume of spare parts we know suitable cars;

  • coverage quality – percentage of errors in reference data;

  • Ad coverage – for what volume of ads do we know suitable cars.

A dedicated team of parts experts helps us monitor data quality, does the marking, creates sets of standards, and manages brand relationships. If we suddenly receive diesel engine units from one of the sources as compatible with the petrol modification of the car, we will see this on the metrics and dashboards, and will be able to quickly localize the problem and take action.

Logic for determining compatibility

We determine which vehicle modifications a specific part fits. Modification is the last of the four features in the chain: Brand → Model → Generation → Modification. This feature is determined by a combination of data on the engine, transmission, and drive type.

More details step by step:

  1. We get the spare part identifier by brand and article: this is how a unique SKU is determined in the world of spare parts. The article is pre-normalized – spaces and insignificant symbols are removed.

  2. Find a list of parts of related brands. You need to go through the list of related brands and check if there is another part with the same article number, but a different brand.

  3. For the found spare parts, we get a list of analogs and original replacements – crosses. We find all part1→part2 links, where the original spare part is present on any side. These are first-level crosses.

    You can go further and get replacements for replacements, but with each deepening the quality of the connections will deteriorate. Having a steering rack as the original article, at the 5-6 level of crosses you can get a nut from this rack.

    Crosses have their own conditional classification: for example, crosses to the original, a part in assembly to an element and vice versa, original replacements. Different sources have different data quality and trust level, we take all this into account.

  4. For parts obtained from crosses, it is necessary to output parts of related brands. Not all parts have compatibility data, but such information may exist for some of the related components.

    Similar to point 2, in this case for each part you need to determine the parts of related brands. This is done by enumerating the options.

  5. For the resulting list of spare parts, we get another list – compatibility with car modifications. It is worth considering that the data on the applicability of a spare part to a car can be both for original parts and for analog spare parts. These are different subsets with their own characteristics and different levels of quality.

One original spare part can lead to 1000 variants for which compatible car modifications need to be found. The final number of combinations reaches a billion, but during the calculation, significantly more are checked.

Key complexity of the case

Reference data receives millions of rows of updates daily. Each addition of brand, synonym, related brand, cross, or compatibility data affects hundreds of thousands of combinations. The impact (diff) itself is very expensive, if not impossible, to calculate in real time.

To process a user's request, it is necessary to determine in advance which cars the spare part from the ad fits and transfer this information to the search engine for indexing. At the same time, ads can be filled with feeds with tens of thousands of SKUs – recalculating compatibility for each is expensive.

We frequently improve the compatibility calculation algorithms, so we need to be able to immediately recalculate the impact of changes on product quality metrics, which requires recalculating all data. During development, we cannot wait several months for everything to be updated. This is a key advantage that will allow us to respond as quickly as possible to changes in data quality.

Proven solutions:

  • line-by-line calculation “head-on”. We receive each spare part in turn and calculate the list of cars to which it fits. Takes from 8 hours, while causing an intensive load on the database in the form of hundreds of millions of requests;

  • pure SQL implementation in postgresql. Processes data for over 20 hours, while eating up CPU and IO. This is because the calculation logic is complex, requires intersections of multi-million tables and iteration over them, creating temporary views. It is impossible to perform such calculations on a loaded product database, and maintaining a copy is expensive in terms of resources;

  • change the data storage structure, create a sharded database with “part-car” links. In this case, data reading is very fast, but you will have to store up to a billion additional rows, which are difficult to update. For example, adding an alias for a brand can affect tens of millions of “part-car” links, leading to simultaneous recording in a large number of shards, while the links themselves will need to be calculated based on all reference data. We are intensively changing the factors that affect the calculation of the link;

  • A similar case at a spare parts distributor required about three weeks of background processing by several PHP workers that were configured not to load the database.

Solution

We could draw a beautiful complex diagram with several processing options, including DWH, columnar DBs, graph algorithms and ML. Or demand a sharded cluster on N dozen servers and spend six months on development with a dozen engineers. But we are a product team, this is not our approach.

What if we can calculate all the options in 10 minutes? And on a small standard instance without additional hardware? It sounds fantastic, but we did it.

The chosen configuration is 4 CPU cores and 15 GB of RAM (the size of a small server). Such limits are a serious challenge for the “optimization master” section. It's time to unpack Golang and pprof!

The plan of action is as follows:

  1. read all necessary data into the application memory;

  2. construct effective indices that take into account the specifics of the task;

  3. try all combinations;

  4. Once a day, perform a background calculation of connections and transfer them as a dictionary for indexing in a search engine.

The number of auto parts in the world is not infinite. Based on ten years of experience working with various suppliers of data on parts, I can assume that we already have most of the data at our disposal. Even a two-fold increase is unlikely; rather, we should expect a gradual decrease in volumes due to more compact storage.

Step one: read the data

Before you start arranging data in memory, you need to learn how to quickly read it from the DB. There are many pitfalls hidden here. The most effective method of “reading”:

(pseudocode):

SELECT id, number, vendor FROM myTable WHERE id > $x  ORDER BY id ASC LIMIT $y

Where “x” is the last read identifier, “y” is the read limit, which is selected for the DB settings. We read all the lines from the DB in portions, by Primary Key. Too large “y” (limit) can lead to the exhaustion of the work_mem query (postgres), which will lead to the creation of a temporary file on the disk and increased IO write with all the ensuing consequences, we do not need this. We select the correct “y”, in our case it is several hundred thousand records.

Started reading data in large volumes and caught the most severe peaks of RAM consumption by the application. After reading 450 million lines from the database, I got dictionaries of 10 Gb and peak consumption by the process of about 18 Gb of RAM.

Armed with a profiler, I found that the main “leak” occurs in several places at the DB driver level. For example, receiving and converting text column data (scan) occurs through the creation of a byte slice, which escapes to the heap. In different driver implementations and versions, the picture is similar, somewhere there are more additional allocations, somewhere less.

At the same time, there was a naive expectation that the GC would come and clean everything, but no, the GC still didn’t come…

The thing is that by default Go starts with a memory trigger setting of 100% of the current memory usage.

Collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage.

If an application has 5 Gb of data allocated, the trigger for starting GC on memory will be triggered when an additional 5 Gb is allocated. With small amounts of data inside the application, such behavior is completely unnoticeable, but with volumes of tens of gigabytes, it is noticeable and critical.

The debug.SetGCPercent(5) setting with the trigger when 5% of the allocated memory was reached saved me. The insight did not come immediately, at first the problem was solved by a forced call runtime.GC()which was not as effective.

Now the GC was triggered much more often and freed memory, which is what was required. Since the worker does not process realtime requests, this did not have any negative effects.

Next, it is necessary to come up with methods for efficient storage of the initial data, to arrange them in such a way that when trying combinations, there is a “constant” access/search speed. This was not without numerous tricks.

Cross index

Cross — a bundle of analog parts part1→part2, in the DB it is stored in two int64 columns. For optimal search of all analogs for a specific spare part, it is necessary to prepare the data so that by the spare part ID you can get the entire list of analogs without having to go through all the data. In this case, the option with a Map of the following type was suitable:

part1 → [part2, part3]
part2 → [part1]          
part3 → [part1]       

Such a map took up about 10 Gb of RAM, while about half of the connections during calculation did not lead to finding compatibility (cross parts did not have a direct connection with the car). It was necessary to optimize storage.

Optimization #1

It was decided to move the loading of the cross reference book to the very last stage, when the rest of the data is collected and it is possible to understand whether there is compatibility for the cross part. A stage of checking the feasibility of storing the cross in memory has been added. Before writing the parts from the cross to the Map, a search is made for all related parts for the part1→part2 pair, then the presence of at least one compatibility with the car for them is checked. If compatibility is not found, there is no point in entering it into the map.

This optimization reduced the index size by ~25%.

Optimization #2

All parts of related brands were “collapsed” to one variant with the smallest brand id (replaced with one related part). This is acceptable in the current case, the data is used only to obtain compatibility.

Example:

27310-3N420 Hyundai partId: 1, brandId: 1   
27310-3N420 Kia     partId: 2, brandId: 2  
27310-3N420 Mobis   partId: 3, brandId: 3

заменяется на 

27310-3N420 Hyundai partId: 1, brandId: 1

Only those crosses that could be used to achieve compatibility with a car remained in the index, while duplicate combinations that led to identical results were removed.

We managed to reduce the size of the cross index from 20 Gb (in the database) to 1.8 Gb (in the application).

The parts index is a Map with tens of millions of elements.

It is well known that Go does not like large maps with reference types, but searching for related brand parts should be done by SKU (string).

Here the trick with getting the CRC32 hash from the article string came in handy. The hash was used as the Map key, which reduced the size of the memory consumed (about 3Gb) and made life easier for the GC.

index[unit32][]entity.Part

We are not afraid of hash collisions; as part of the processing, we go through the list of spare parts, checking the article number and brand when searching for related names.

Why CRC32:

There were experiments with xxHash, in this case they gave a very insignificant increase in performance. I checked several more algorithms with the implementation in Go from the list https://github.com/rurban/smhasherall of them did not provide any particular advantage. The hash calculation itself was not a bottleneck.

To quickly try out combinations, two options for accessing data were required:

  • by identifier (partId), for working with crosses;

  • by article hash, to search for spare parts of related brands.

Had to store an additional index:

part[int64]entity.Part

This increased the memory consumption by ~1Gb, but significantly accelerated data processing. Unlike the DB index, data was also stored, which also affects the consumed volume.

Compatibility Index

The data on which car the spare part fits in the database are in the form of partId – carId lines. This data is used as a basis for further compatibility calculations.

In the application, I used a map where the key is the part identifier, and the value is a slice with the identifiers of the cars to which this part fits.

 compatibility[int64][]int64

Why not store “raw data” in this form in the DB right away? We intensively collect information from various sources, and we need to be able to update them independently. With such a storage structure, data collection workers will conflict for rows when updating and cause locks.

Comparison of index sizes

Application

Database index size

Spare parts

6.2 Gb (with data)

4.8 Gb

Crosses

1.8 Gb

20 Gb

Compatibility

1.4 Gb

7.6 Gb

The application managed to place data more compactly and efficiently.

We are calculating the options

  • we go through the index of spare parts and add them to the buffered processing channel;

  • the processing channel is read by workers (8 goroutines);

  • the worker receives a part id as input, forms a ByteBuffer of the agreed format as output, and sends it to the buffered write channel;

  • the write channel is read by one goroutine, which is responsible for actually writing to the file;

  • The file is sent to the search system for further processing.

The output is about 1 billion combinations. A file with such a number of lines would be very large – we come up with a method of “collapsing” lines with logical data compression. It can be represented as a csv file with a non-standard structure: data on the part and a json array with a list of car identifiers at the end.

A file with up to 50 million lines is much better, but still a lot.

We split the data into several files and add stream compression.

Other optimizations

During development, many minor optimizations were added:

  • abandoning sync.Pool in favor of FreeList;

  • widespread use of local buffers based on the article Debugging performance issues in Go programs by Dmitry Vyukov on the Intel Developer Zone blog;

  • passing buffers as arguments;

  • replacing map with slice where possible;

  • correction of the smallest allocations, up to the use of strconv.AppendInt for more optimal writing of int in []byte .

I would like to cover the issue of optimization and profiling in more detail, but this requires writing a separate article, even longer in volume.

Perhaps in the next series I will tell how it would be possible to make the calculation stage 2 times faster, but getting a random order of lines in the results is difficult to compare during testing; about the subtleties of the language itself and case insights.

Results

Example of a test run:

Reading data from the database into indexes (about 450 million rows)

11m8s

Data processing (8 goroutines)

1m40s

Total processing time

12m48s

Total size of indexes

9.5 Gb

Maximum memory consumption at runtime

10.77 Gb

Spare parts selected

31 million

Combinations found

703 million

Experiments and development of the first version took several days, the production version was ready in a sprint, no additional hardware resources were required. The data volume increased, new optimizations and improvements appeared, the article provides a description of the logic close to the final one.

System has been working in production for two yearscopes with the load.

PS: it is quite possible to design a storage structure for a specific task more optimally than the general solutions offer. In this case, the most effective optimizations were based on accelerating the transformation of the source data flow into the final state by skipping or simplifying the transformation steps, usually this is closely related to the specifics of the subject area, a detailed understanding of which opens up the broadest opportunities for optimization.

A complex stack of technologies is not always necessary to solve a problem; you can get by with a basic set that you are well versed in, systems thinking, and an understanding of the subject area.

Thank you for taking the time to read this article! Ask questions in the commentsI will be happy to answer them and discuss the accumulated experience. See you there!

Subscribe to AvitoTech channel in Telegramwhere we talk more about the professional experience of our engineers, projects and work at Avito, and also announce meetups and articles.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *