How we solved the problem of segmenting business objects

Hello! My name is Vladimir, I am the head of the development and testing department at SIGMA. Today I want to tell you how our team improved the customer’s CRM system. It is used to control all kinds of communications with clients – from calls to the hotline and correspondence in instant messengers to office visits and mailings. Architecturally, CRM is designed in such a way that it can support the provision of almost any service, but historically it has been focused on interaction with clients of energy supply companies.

Our task was to write a subsystem that would allow us to customize conditions and segment the client base in accordance with them. Customers that meet the specified conditions will fall into a specific segment. The customer needs this function in order to build a dialogue with clients taking into account their psychological profile and preferences, as well as to offer targeted services.

We needed to implement the ability to classify core business objects without detailed data analysis. Without thinking twice, we complicated the task to the wording: “any business objects that meet the conditions will fall into a certain segment.”

There are many known ways to solve the problem (essentially a classification problem). Up to gradient models of unsupervised learning, when we do not know in advance what we are looking for. Our implementation is not so complicated; it operates with specific categories known to the analyst and only identifies compliance with these categories among the objects under consideration.

The purpose of this article is to show that with the right approach it is possible to easily scale the solution, as well as reuse individual parts of the subsystem for other needs. And speaking in human language, this is a story about what a thrill it is to write systems consisting of functional cubes that, like Lego, can be clicked from place to place and ultimately get a complex working structure.

First iteration

From this statement it immediately follows that we must have at least two large entities – segments and conditions.

To begin with, let’s present the simplest set of entities in the “Segmentation” block. Let’s immediately take into account that the object for a segment can be any entity – not just a client, as in the original task.

Note that there is no requirement to maintain a history of segment composition changes. It is enough to periodically recalculate this composition, removing objects that no longer meet the condition, and adding those that have come into compliance since the last calculation. And in order to know when the segment was calculated last time, we introduce another table – the segment log. Using it, we will also determine that the segment is already in the process of calculation and does not need to be touched.

Here and further I omit details about:

· auxiliary tables such as availability for systems, logs, directories, etc.

· technical fields, audit fields, logical deletion, etc.

· an insane number of functional checks that prevent something from being saved in the wrong format, composition or order.

· elements of fault tolerance that allow you not to interrupt the process of calculating the entire batch if one “crooked” object is encountered, and at the same time not to lose sight of the errors that the process encountered.

Now to the conditions. At the time the segmentation task appeared, a condition constructor had already been implemented in the customer’s CRM system. It was used to build branching business processes. The only thing that didn’t suit us was that the calculation was made for 1 object at a time. It needed to be redesigned to support mass calculation. However, here I will describe the architecture of the condition builder as if we were making it from scratch, from simple to complex. So, the simplest set of entities according to the conditions.

Let's decipher the meaning of the fields.

1. The operands on the left and on the right are the new entity that we introduce. It is clear that to calculate the conditions we need data, and we must somehow obtain it from sources. This is a non-trivial process, and also potentially useful for other tasks, so it is worthy of independent implementation.

2. An operator is a Boolean function. We used ==, !=, >, <, >=, <=, is null, is not null, like and other standard operations that you see in any more or less decent grid filter. Note that not all operations require a right-hand side.

3. On the right there can be:

· a constant, which is what happens most often. It is best to make this field type jsonb or text, since it can take very different values.

· another operand. This can be useful if we need to compare two attributes over different periods. An example of a ready-made condition with an operand on the right is “Profit for the last month” – (operand on the left) “less” – (operator) “profit for the current month” – (operand on the right).

We introduce a new entity – tag. It will act as our operand. Like the condition, the tag was already implemented in the system even earlier and was used for different purposes. For example, a customer used tags in the message builder to send personalized SMS. But I will also describe this entity here as if we were inventing it from scratch, from simple to complex. So, a simple tag model.

Now we can get the value of any column of any database table by its ID. The ID should arrive among the incoming parameters of the running process. To do this, we had to expand the directory of system entities with data about their physical address.

Now that we have a condition, we connect a link to it in segmentation. Of course, the segment itself must refer to the condition. That is, each segment is calculated according to one condition.

The model is ready. Using such a model, we can already calculate the most primitive segment, when just one direct attribute of our object corresponds to a simple Boolean condition. Functionally it will work like this:

1. We begin calculating the segment.

2. Using the reference to the object type, we find the ID of all instances of the system entity.

3. For each object, we transfer the ID into the condition calculation.

4. The condition causes the tag(s) to be evaluated, passing the same ID.

5. The condition performs an operation on the results and returns a Boolean value to the segment.

6. If true is returned, we place the object in the segment, false – we remove it from the segment.

Let's depict it in the form of a Venn diagram.

What's wrong with this functionality?

1. The main complaint is the meager capabilities it provides. Nobody needs functionality that can only compare one field from the same table where the objects for the segment are located.

1.1. We need to make it possible to operate not only with flat table data, but also to obtain arbitrary slices of data in the plane of our object.

1.2. One condition is not enough; it is necessary to make it possible to construct compound conditions from several operations.

2. The second complaint concerns performance.

2.1. Iterating through multimillion-dollar tables one row at a time is an unaffordable luxury, so you need to organize the ability to mass calculate conditions and mass obtain tag values.

2.2. It is desirable to be able to somehow narrow the range of possible objects at the entrance to the segment, and not simply take all the data from the table as is now.

Second iteration

Let's move in order. To customize data slices (1.1.), we introduce a new type of them into tags –function tags. And let's call those that came before system tags. Functional tags for receiving data do not simply access a system entity, but call a certain function. Let's expand our data model.

Now, if there is a link to a system entity, this is a system tag, and if there is a link to a function, it is a functional tag. It is possible to organize another way to refer to the source in order to avoid “chessboard” data storage, but we decided not to sacrifice FK for the sake of aesthetics.

The parameter type is not about the data type. This field can take one of the following values:

1. Constant. Then in the “Value” field we will write the value that we want to send to the function.

2. Custom. This is the parameter we expect from the outside. For example, if the consumer of the functionality is segmentation, we expect this parameter from either the condition setting or the segment setting. There are no such options there yet, but we will get to them.

1. Tag. In the “Value” field we will enter the tag ID. This parameter also comes from outside, but it is not configurable/user-defined. Why do we refer to a tag? This is quite convenient – the table of tags already has a description of what the field is called and what type of data it is. In addition, system and functional tags often require approximately the same input fields. In the context of segmentation with this type, the object ID parameter will be transmitted.

The functions we can use for segmentation have the following limitations:

2. The required input parameter is an array of object IDs. This way we at the same time solve the performance problem (2.1.) by transferring data in chunks at once.

3. The function returns a table of data.

4. The outgoing table must have an “Object ID” field with the same name as the incoming parameter.

5. The table must be unique based on this field. That is, we receive a slice of arbitrary data, but strictly in the plane of the object under study.

We can now even pre-filter which conditions are available for a segment. The system entity ID of all tags participating in the condition must strictly match the object type ID of the segment itself.

On the condition side, we are also making some changes to the model for this point.

The “Right operand” field is a boolean that allows you to separate the parameters of the left and right operand.

This approach allows you to pass Custom parameters to tags. This way we can call the same function tag for different conditions, but with different parameters. For example, a tag calculates the number of contacts with a client over a certain period. Then you can configure the conditions “number of contacts per month > 0” and “number of contacts per year > 6” with this tag, passing the number of days as a user parameter.

Now let's look at point 1.2. — namely, “Compound Conditions.” Making them is not that difficult. You just need to combine simple conditions into groups using the AND or OR operator. Let's add a feature to the model.

We added the “Conditions Group” field, which will contain an array of IDs of conditions combined into a group. And the operator will be registered in the “Operator” field. Why did we again choose “chessboard” data storage, rather than selecting a junction table? Both the compound and the simple condition are equally of interest to us precisely as conditions: in fact, for processes they are indistinguishable and perform the same function – they return a Boolean calculation result. As a result, we get something as close as possible to a grid filter like this:

Now let's do optimization. According to clause 2.1. We have already organized a batch transmission of the object ID to the conditions. I will only note that we had to rack our brains over this point. The fact is that single conditions can be calculated based on several types of system entities, even a dozen at once. There may be an elaborate condition based on the client’s attributes and some payment information. The main thing is that all the necessary parameters are passed to the calculation input. But for a batch approach, such a complication is difficult to implement, and for segmentation it is completely redundant. Therefore, we introduced specialization and restrictions on conditions. Now you can send either a packet of IDs, but the condition must depend entirely on one entity, or one ID for an arbitrary number of entities. If a condition depends on one entity, it can be used for any functionality – be it batch or single.

In terms of the process, we began to divide the data from the source into chunks of 10,000 records and calculate the conditions for the entire chunk. Then immediately save it into a segment and move on to the next one. So, if the system fails, we will miss only part of the segment, and this part can simply be counted by starting the process again.

For paragraph 2.2. We have written a number of area functions that can be connected to a segment. The idea was to identify frequently used data slices and narrow the sample as much as possible before starting the calculation. For example: “all FL clients”, “all female clients” or “all clients for whom any changes have occurred over the past week”. These functions have no input parameters other than the segment ID, just to ignore newly calculated objects. And they output only an array of object IDs. The scheme is like this:

We need the ID of the system entity in the new table so that for the elephant segment we can select only the elephant range function and not accidentally count hippos 🙂

Let's look at what we got. Currently we can:

· Create any segments for any objects that we can reach and unambiguously connect them with the environment.

· Designate in advance the area of ​​segmentation objects

· In these segments we use tree-like conditions of arbitrary complexity and nesting level.

· Conditions, in turn, can receive arbitrary data for themselves from any available sources. This may include Rest API services.

· As a bonus, the parameters of these functions can be passed in several different ways from different stages of the segment calculation.

· I haven't addressed this separately, but it may not be obvious. Since now the conditions can be complex, the operands in them can be data from the same sources with the same parameters. We optimized the process so that when calculating a segment, each function with a unique set of input parameter values ​​is called only once.

This will already pass for MVP, but for complete happiness a couple more details are missing:

1. Now all segments are independent of each other, and sometimes there is a need to somehow group them. The simplest example is when we have several segments for certain conditions and another one labeled “rest”. That is, it should include all remaining objects from the area that did not fall into other segments of the group. Even now it would be possible to simply make a condition that would check if the object is in the segment, but we found this inconvenient.

2. There are still questions regarding performance, in particular regarding the calculation of conditions for similar segments using data from the same sources. When calculating on the same day, this data is still requested independently, rather than reused from the previous request. When tens of millions of objects are segmented, even if each 10,000 chunk is processed in a couple of seconds, the entire segment is still calculated in several hours. And using already calculated tags would be very helpful.

Third iteration.

To solve the first problem, we added the ability to include/exclude some segments into others.

The interaction type can take the values ​​“enabled” and “excluded”. If enabled, a child segment can only contain objects from its parent. If “excluded” – on the contrary, the child cannot contain objects of the parent.

I will say right away – as can be seen from the model, we did not construct multi-story “trees” with AND/OR relationships. If a segment is included in multiple parents, its objects must be represented in each of them. Conversely, if a segment is excluded from multiple parents, no object from any of the parents should be in it.

This seemingly small innovation gave rise to quite a lot of difficulties and conceptual solutions:

1. The fact is that these interactions are essentially segment conditions, which means that the conditions themselves now become optional. You can build an entire segment only on its relationship with others.

2. Less obvious is that the “inclusion” of a segment into other segments is also a function of the area, which means that the area is also now optional. Moreover, the logic differs in different situations. If we only have the area function or inclusion, we delineate the area with this available set. If we have both of these mechanisms, we are forced to take only what is at their intersection.

3. There was a need to take into account the order of calculation of segments, so that descendants were always calculated after their parents. Moreover, even now you need to make sure that the connections do not become looped.

The Venn diagram of different sets has become more interesting. Let's look at the basic operations on sets.

Now working with a segment occurs in two stages:

1. First, the final area is calculated (in the diagram this is a set of sectors 1-4), and the first step is to immediately remove from the segment all objects that are not included in it (in the diagram this is the entire orange sector, conventionally marked with the number 5).

2. Then the conditions are calculated for the entire range 1-4. Sector 2 objects are added to the segment, and sector 4 is removed from the segment.

Now let's solve the problem of reusing tag values. Interestingly, we solved it even a little more efficiently than we originally planned. According to the original idea, we wanted to do something like this:

But such a solution would be incomplete, because we are interested in minimizing source calls as such. Therefore, it would be more correct to focus on the functions of the tag, and not on the tags themselves. But a function can have different parameters for calling it, which means that it is necessary to record unique calls, and not the function itself. As a result, we came to this decision:

The table is not associated with any business entity. The binding is a text code unique for each process. To make it unique, it includes the name of the function and the values ​​of customizable parameters. For example, we have a function that receives quantitative data on contacts for a period by client. For input we submit a chunk of 10,000 clients and the number of days of the period. Chunk is a required parameter and not customizable, so we ignore it, but the period can change. Then the process code will be built from _ (for example, cnt_contact_365 or cnt_contact_30).

For all tagging functions, we have added the optional ability to use a special layer. The algorithm is like this:

1. The function generates the process code.

2. Selects chunk objects for which the new table did NOT contain a “process code – object ID” pair with a receipt date not earlier than today (can be changed if the data does not become outdated longer).

3. Calls the tag function for the selected objects.

4. Saves the obtained values ​​into a table. Moreover, all the values, not just what is needed for the current tag. The tag value field is in jsonb format for any set of outgoing data.

5. Using a query, we obtain from our table the values ​​of our tag for each object and return them to the calculation of conditions.

Thanks to this approach, if we call the same function today for another tag or from another segment, we will not find uncounted objects in point 2 and will immediately go to point 5, bypassing the energy-consuming point 3.

Summarize:

1. We can:

· segment arbitrary business objects;

· configure basic dependencies of segments on each other;

· use arbitrary multi-story conditions based on arbitrary data from available sources.

2. The entire process is algorithmically optimized so that, if possible, no actions are performed twice.

3. As a bonus, almost all elements of the system can be and are used to solve other problems – from calculating the conditions of marketing promotions to sending messages. Even the segments themselves have a “manual” version, which I left out. It allows you to fill it out directly and via the API. This allows objects to be labeled with subjective labels.

Thank you for your attention! I will be glad to answer your questions.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *