His Majesty the Count

Graphs are a special topic for me, there is something mysterious and powerful about them.

We didn't study graph theory at university or school. At work, we never said the word. But graphs are everywhere. And you can make your life much easier if you learn to see them and use the numerous developments in visualization and algorithms.

I won't go into the basics of graphs, they are in Wikipedia.

The purpose of the article is to show some cases from my practice, when graphs became a natural part of some task. Sometimes it was impossible to solve the task without them. Sometimes the solution was more elegant with them. And sometimes it was just a craving for perfectionism, graphs are cool)

Well, let's go, it will be interesting!

1. Remove uninteresting columns from the table (connected components)

Well designed tables are important. Data quality is always my priority. That's why I develop my library FlatTableAnalysis. It allows you to perform a basic architectural analysis of the table, highlight the primary key, and find interesting functional dependencies between columns.

A table may contain columns that mutually functionally define each other (i.e. there is a one-to-one relationship between them).

Visually one-to-one

Product

Manager

City

Sales

Lemon

Peter

Moscow

3

Banana

Vasya

Moscow

22

Pumpkin

Dasha

Tula

3

Pumpkin

Dasha

Voronezh

4

Products and Managers define each other functionally. But City does not, it is different for Dasha, for example. As a result, the Manager column can be deleted without affecting the architectural analysis of the table.

Such columns are not interesting for architectural analysis (from this point of view they do not carry additional valuable information). For simplicity, they should be removed. That is, if two columns relate one-to-one, you can leave only one of them. But you should remember that one-to-one relationships can also be between 3/4… columns.

I tried to write a long and ugly code that would do it. But then I saw a graph here and wrote this:

Elegant pseudocode for deleting columns one-by-one
# сначала готовим данные для графа
# название каждого столбца - вершина
# ребро есть только если между вершинами установлено отношение один-к-одному
edge_list = []
for col_1, col_2 in it.combinations(df, r=2):
    if is_one_one_relation(col_1, col_2):
        edge_list.append(cols)

# делаем граф
G = nx.Graph()
G.add_edges_from(edge_list)

# я сторонник детерминизма, поэтому сначала сортирую каждую компоненту связности
# сортировка идет по порядку появления в таблице слева-направо
# и только потом удаляю всё кроме первой вершины в каждой компоненте связности
ccs = [sorted(cc, key=get_column_pos) for cc in nx.connected_components(G)]
to_delete = [cc[1:] for cc in ccs]

# получаем плоский список названий колонок, уже без один-к-одному!
# допущение: в шапке таблице нет названий-дубликатов
to_delete = list(it.chain.from_iterable(to_delete))

Of course, you can do it without a graph, but it will be less beautiful and understandable. Code with a graph is close to the human direct logic of solving the problem.

2. Simplify the graph for visual clarity (transitive reduction)

My library again FlatTableAnalysis. Wrote a visualization of a graph with functional dependencies between columns (vertices are table columns, edges are functional dependencies). Usually, a complex portrait is obtained. I want something simpler. One of the options is to remove edges that “logically” follow from others.

Transitive dependencies

The edge 1->4 can be omitted. In general, it is clear that from vertex 1 you can reach vertex 4. Without the edge 1->4, the graph will be noticeably simpler, and the essence will remain the same.

In real graphs there are not 4 vertices and all this quickly becomes unreadable.

Of course, this approach is not always applicable, it depends on the task. In my case, it was acceptable to remove such edges.

I didn't even think about it here, because writing such an algorithm manually is not the most pleasant thing. I used networkx, the full code for such a simplification of the graph is below. Simple and elegant. And fast, by the way.

Transitive Reduction Code
# строим граф
G = nx.DiGraph()
G.add_edges_from(edge_list)

# весь код - одна строчка :)
G_tr = nx.transitive_reduction(G)

# тут небольшая особенность
# в новый граф надо скопировать данные вершин и ребер вручную из старого
G_tr.add_nodes_from(G.nodes(data=True))
G_tr.add_edges_from((u, v, G.edges[u, v]) for u, v in G_tr.edges)

3. Resettle people according to their wishes (inverted graph, maximal independent set)

Once a year we go on holiday somewhere with colleagues. This time we decided to spend the night. We decided that there would be two of us in a room. Each of us was sent a form from HR, where we marked a list of people we would like to check in with. In fact, at the exit there will be a table with two columns: “employee”, “list of people we would like to check in with”.

While I was filling it out, I kept thinking about how they would process this data later. The task turned out to be complicated, it is formulated as follows: how to maximize the number of fulfilled wishes of people. If you resettle randomly or with a simple greedy algorithm, then maximization is not guaranteed.

At some point I came to the following algorithm:

  1. Take a graph of people's wishes (vertex – person, edges – wishes)

  2. We leave only mutual wishes, the rest are removed from the column

  3. Next, you need to make a new graph in a special way – Edge_graph. Each edge of the old graph becomes a vertex. If the edges were adjacent in the old graph, then in the new graph there will be an edge between such vertices.

  4. In the new graph, if there is an edge between two vertices, then both vertices cannot be taken. It is necessary to take the maximum number of vertices, but so that there are no edges at all. And this is the definition of the maximal independent set algorithm. I even removed a short video about it, everything is explained more clearly there.

In reality, people were distributed in a semi-manual mode)) but the very fact that graph algorithms were emerging here impressed me.

4. Automatically remove inaccurate duplicates (Minimum vertex cover)

This is my favorite case. There is a list of short strings. There are imprecise duplicates. Let's say some threshold is set for the normalized Levenshtein distance. How to automatically remove all imprecise duplicates from such a list? In general, to be precise, the task should sound like this:

How can I remove the minimum number of rows so that none of the remaining pairs will meet the definition of imprecise duplicates?

The first thought is – let's calculate Levenshtein for all pairs. If the distance in a pair is less than the threshold (=inaccurate duplicates) – remove any vertex of the two. It turns out that this method does not guarantee minimality. Although in real problems, I think this is the optimal way (fast and clear).

Пример невыполнения условия "минимальности", если используем алгоритм выше:
  
строчки A, B, C
A-B - дубликаты
B-C - дубликаты
A-C - не дубликаты

Есть вероятность того, что наш алгоритм удалит вершины A, С
Это избыточно, можно просто B удалить!

To solve this problem accurately, you will need the following algorithm:

  1. Calculate all distances, make a graph (rows are vertices, edges are distances between them)

  2. Remove all edges greater than threshold (they are not duplicates => we are not interested)

  3. Split the graph into connected components

  4. For each component, apply the Minimum vertex cover algorithm. The logic is simple. We can delete vertices, but we need to delete as few as possible. After deleting, there should be no edges in the graph (since at this step we only had edges less than the threshold, i.e. inaccurate duplicates). The Minimum vertex cover algorithm will give us the minimum set of vertices to delete (so that there are no edges left at all)

  5. The last step is to remove the set of vertices obtained in step 4 from the original set of vertices.

    For this task I filmed a videoI explain the solution in more detail there.

    In step 4, the Maximal independent set algorithm could have been used. In theory, the results should have been identical.

5. Set up cost closure between production units without cycles (simple cycles, Feedback arc set)

And this is a real combat mission. In two parts.

First they came to us and brought a graph. Departments and their logical directions of closing costs (on other departments). There are cycles in the graph. They asked us to find them. Well, it's simple:

list(nx.simple_cycles(graph))
# код может зависнуть, т.к. в графах может быть огромное кол-во циклов

# поэтому мы делали так:
itertools.islice(nx.simple_cycles(graph), 10_000)
# и отдавали коллегам посмотреть))

But of course, the colleagues didn't stop there. They asked for a way to remove cycles to ensure that the graph would be DAGom. That is, the task is to remove the minimum number of edges so that 0 simple cycles remain. Here we thought about it. There was no obvious solution in sight. Theoretically, it was possible to delve into graph theory, but you can drown there.

Help came from a friend with whom we used to work. He noticed something that I missed. There is a python library – igraph, and there is a function feedback_arc_set(weights=None, method='eades'). This is exactly what is needed =) The only thing is that to apply the algorithm the graph had to be translated from networkx to igraph, and then back (there are two good utils functions in igraph: from_networkx, to_networkx).

Surprisingly, this function even worked for the next request from colleagues. They asked not to delete some edges, because it was impossible according to business logic. This was solved by the weights argument.

So graphs are not only networkx. But sometimes igraph,

6. Split the dataset into 10 parts taking into account intersections (Balanced Graph Partitioning)

This is perhaps the most difficult case.

Kaggle is a great place to learn more about neural networks, LLM, encoder training, etc. In some ways, better than any course.

I met a competition there Learning Equality – Curriculum Recommendations. A little about the task. There are topics, there are training materials, both entities are just text lines. The relationship between them is many-to-many. It was planned to train a bi-encoder (BERT) for the semantic proximity task. There is a dataset for training. Now the question is, how to divide the dataset taking into account 10 K-Folds? (This is how the author of the 1st prize solution on leaderboard did it)

It seems simple, we take and divide into 10 parts. But there is a nuance. Since the topics and materials are related many-to-many, we need to divide it so that there are fewer intersections. A visual description of the intersections is under the spoiler.

Material intersections

No.

Subject

Material

1

1

A

2

1

B

3

2

A

4

2

B

In the table we have two Topics, but they overlap greatly in Materials. Thus, when dividing by 10 K-Folds, it is desirable to place all these four lines in one fold. Otherwise, there will be strong connections between the folds.

The author of the solution wrote a heuristic algorithm (unfortunately, I did not have time to understand it). But by the feeling, it is not exact, but quite good, approximate method of solving the problem. More here.

As usual, I wondered if there was a more precise way to solve this problem?

We have Topics, usually fewer of them. And Materials, more of them.

You can do it like this:

  1. Take our Topics, these will be the vertices of the graph

  2. Each Topic has links to some Materials

  3. Let's make it so that the edges of our graph are the number of common Materials between two corresponding Topics. An example for understanding is under the spoiler below.

  4. Ok, we have a graph. Now we need to divide the graph into two 10 parts. Ideally, we need to minimize the sum of the weights of the edges between all parts (since this is the intersection strength). But in this formulation, I have not found a solution. Let's at least minimize the number of intersections of Themes between the parts.

Visual intersection graph

Subject

Material

1

A

1

B

1

C

2

A

2

B

2

D

There are two Topics and they “intersect” on two Materials. That is, we will have two vertices (Topics) in the graph, one edge between them with a weight of 2 (the strength of the intersection of Topics).

After searching, I realized that networkx and igraph won't help. I didn't find anything similar there. It's probably possible to reduce the task to graph partitioning or community detection, but this is still a strong change to the original task.

I had to look for another way. I found it. article And git. Almost suitable – the Balanced Graph Partitioning algorithm. The only thing is that the edge weights are not taken into account, only the fact of intersection between parts. Quote:

“In this paper we consider the problem of (k, ν)-balanced graph partitioning – dividing the vertices of a graph into k almost equal size components (each of size less than ν · nk ) so that the capacity of edges between different components is minimized.”

I haven't tested the algorithm on the full graph from Kaggle, there are hundreds of thousands of vertices and edges. Therefore, I suspect that there will be problems with speed. However, the solution via Balanced Graph Partitioning is the closest to the ideal mathematical one. Which was important for me.

More examples of tools that are directly or indirectly based on the graph

Visualization – we use the graphviz library, for simple cases networkx. For loaded graphs, you can make community detection and draw only them. Otherwise, you may get a web.

Trees (a special case of a graph) – are constantly encountered. Reference books are often presented as a hierarchy (employees, materials, companies). Contracts can be presented as a hierarchy (if using system numbering: 1 – 1.1 – 1.1.1 – 1.1.2, etc.).

Openstreetmap – recently solved the problem of obtaining transport distances between addresses. The open solution is strongly related to graphs. Based on networkx. Parts of the map graph can even be downloaded locally for use.

A problem about a wolf, a goat and a cabbage (river crossing puzzle) – it is not often encountered in business)) but it is surprising that the size of the boat for a guaranteed solution is related to Minimum vertex cover algorithm.

Calculation engine in MS Excel – under the hood it works like this: from all cells with formulas a graph of connections is formed – DAG (we do not take into account cases of cyclic references). Then topological sorting is done and then sequentially, cell by cell, calculations are carried out according to formulas.

And many other things have some connection with graphs: BERT self-attention, Page rank algorithm, Finite-state machine, python AST tree, business processes…

How to start using graphs in practice? A couple of tips

See the graph

A graph is a fairly simple and natural data structure. We often see certain homogeneous entities that have connections between themselves. Moreover, graph vertices are not always physical objects (sometimes people think so). Graph vertices can be anything, for example, states of something, and edges are transitions between states.

Know basic graph metrics

When it is logically clear where the graph is, you should immediately ask the following questions about its basic topology:

  1. How many peaks

  2. How many ribs

  3. Are there loops (when a vertex is connected to itself by an edge)

  4. Are there multiple edges (for example, between two vertices there are two edges in an undirected graph, not a common occurrence)

  5. How many connectivity components are in a graph (a set of vertices that are directly or indirectly connected by edges form one connectivity component)

Portrait of a Count

Next, you can think about whether you need to draw a portrait of the graph (this is what the visual representation of the graph is called). Often this helps to see the structure that is not visible in the metrics.

Formulate the problem in terms of a graph

And most importantly, we need to immediately switch to graph terminology. Translate the business task into the language of the graph. After that, we have a huge arsenal of algorithms/libraries/knowledge to work with.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *