Her Majesty Sorting

At first it seems that sorting is trivial. In Excel, you click “Sort Smallest to Largest” and everything is fine. Well or sorted() in python. Everything is done by itself.

I got seriously acquainted with sorting in the Yandex.Practicum courses (Algorithms and data structures) three years ago. Good course by the way. Then I gradually began to realize that not everything is so simple.

After that, I often encountered sorting questions in various practical problems. Gradually, the conviction began to form that:

Sorting is one of the most fundamental algorithms on Earth.

Interestingly, sorting is the most recent important concept on my list. Previously it already included: Sets, Flat Tables, Graphs, Hierarchies, Combinatorics, Matrices.

Below I describe my view on sorting. I hope this will allow us to take a broader look at “Her Majesty the Sorting”.

It'll be fun, let's go! 🙂

Content

  1. Base

  2. Sorting and rank

  3. Partial sorting

  4. Multi-level sorting

  5. Connecting sorted lists (optimization)

  6. Sorting stability

  7. Key vs compare function

  8. Total ordering

  9. No transitivity – trouble

  10. Topological sorting

  11. Sort by meaning

  12. Combining sorts into one common one (averaging sorts)

  13. Predicted sorting quality assessment (RAG)

  14. LLM writes sorting in VBA (data types accounting)

  15. Linearization (put objects in a row)

  16. Measure of list sorting

  17. Sets (ordered set, sorted set)

  18. Two final thoughts

1. Base

Sorting involves arranging a set of objects linearly (in a row) in ascending/descending order. Therefore, often in python or excel We don’t even think about what happens during sorting.

The simplest case is sorting strings (str), integers (int), fractional numbers (float).

String comparison (str) comes down to multi-level sorting of integers, each character of a string is reduced to its index (for example, via UTF-8). This is also called lexicographic sorting.

ord('R')  # 82. Это номер символа в таблице кодировок UTF-8
ord('Ж')  # 1046
'R' < 'Ж'  # поэтому True
'ЖRЖ' < 'ЖЖЖ'  # True, уровни сравниваются последовательно слева направо
'P' < 'PP'  # True. Особое правило: наличие любого объекта больше, чем его отсутствие

Comparison of fractional numbers (float) logically also reduces to integers. Because For a computer, there are a finite number of fractional numbers and they can be compared with integers (one-to-one).

Sorting integers is trivial. Raymond Hettinger said interestingly (youtube):

God invented integers, floats were already invented by people 🙂

It should be mentioned that sorting is in ascending order (ascending – often default)” and “descending (descending)”.

If we have N objects, then the number of sorting options is equal to N! (factorial). Considering that sorting is the placement (permutation) of objects, sometimes we will encounter combinatorial concepts in the context of sortings.

Sorting is also quite strongly connected with the concept of a binary relation on a set (binary relation, wiki). Below in the article we will meet with this area of ​​mathematics.

Concepts min/max – are logically equal to taking the first/last element of the sorted list. Although the algorithms used here are easier.

Typically the asymptotic complexity of sorting is O(N*logN). Although with a naive implementation it becomes O(N**2). The sorting algorithm is asymmetric in terms of complexity. Because You can check that a sheet is sorted in linear time O(N).

So, it's simple. Or… no, it’s quite complicated! more on that below)

2. Sorting and rank

Sometimes you can find a function that returns the rank for each object in a list – df["some columns"].rank().

Essentially, we are sorting a list. Next, we number it, usually starting with 1. If there were duplicates in the list, they will be next to each other. Further, different approaches are possible to obtain a rank, see the code below (pandas rank):

import pandas as pd
l = [4, 1, 1, 1, 3, 3, 3, 5]  # изначальный список

#   [1, 1, 1, 3, 3, 3, 4, 5]  отсортированный список
#   [1, 2, 3, 4, 5, 6, 7, 8]  нумерация начиная с 1 в отсортированном списке

pd.Series(l).rank(method='average').values.tolist()
# [7.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 8.0]  просто берем среднее всех позиций

pd.Series(l).rank(method='min').values.tolist()
# [7.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 8.0]  берем минимальное

pd.Series(l).rank(method='max').values.tolist()
# [7.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 8.0]  максимальное

pd.Series(l).rank(method='first').values.tolist()
# [7.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 8.0]  сохраняем позиции как есть

pd.Series(l).rank(method='dense').values.tolist()
# [3.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 4.0]  особая, следующая группа +1 к предыдущей

Another interesting point, in numpy no function/method rank. But it can be imitated. I didn't immediately understand why this works, see the code below:

import numpy as np

arr = np.array([4, 1, 1, 1, 3, 3, 3, 5])
np.argsort(np.argsort(arr))  # двойной argsort
# array([6, 0, 1, 2, 3, 4, 5, 7], dtype=int64)

# интересно, что это эквивалентно именно rank(method='first') в pandas
# только индексация с 0 получается

3. Partial sorting

Sorting is so common that numpy optimized function implemented np.partition. This function ensures that only the indexk will contain the correct object with t.z. sorting. Everything on the left is definitely less, and everything on the right is more. But the left and right part will not be sorted internally.

import numpy as np

arr = np.array([4, 1, 1, 1, 3, 3, 3, 5])

np.sort(arr)
array([1, 1, 1, 3, 3, 3, 4, 5])
array([0, 1, 2, 3, 4, 5, 6, 7])  # просто индекс

np.partition(arr, 6)
array([3, 1, 1, 1, 3, 3, 4, 5])
array([0, 1, 2, 3, 4, 5, 6, 7])  # просто индекс

# заметим, что по индексу 6 у нас стоит цифра 4
# это соответствует отсортированному состоянию

In some cases, it is important for us to get the TOP 5 objects, and their sorting is not important to us. This algorithm is faster than full sort.

4. Multi-level sorting

Often you need to sort sequentially across several levels. For example, in a table we first sort by age, if the ages are the same, we sort by name.

IN pandas it is as convenient as possible to do this. We specify a set of columns in sort order [Возраст, Имя]then specify the sorting method[ascending, descending].

If we work with list of tuples and we want to apply the function sorted for multi-level sorting, then do this:

# моделируем таблицу, как list of tuples
# 4 строки, 3 столбца (индекс везде с нуля)
l = [(1, 2, 3), 
     (1, 5, 8), 
     (9, 3, 7), 
     (1, 9, 66)]

# сортировка будет проводиться по лексикографическому подходу (многоуровневая)
sorted(
 l, 
 key=lambda el: (el[1], el[2]),  # сортируем сначала по столбцу с индексом 1, затем 2
 reversed=False,  # все сортировки по возрастанию!
)

# особенность: reversed настройка указывается сразу для всех уровней
# а если нам надо колонку 2 сортировать по убыванию?
# можно поставить знак минус (если тип данных числовой)
sorted(l, key=lambda el: (el[1], -el[2]), reversed=False)

# либо можно опираться на стабильность сортировки
# в этом случае делаем её в обратном порядке, двумя последовательными операциями
sorted(l, key=lambda el: el[2], reversed=True)
sorted(l, key=lambda el: el[1], reversed=False)

# в более сложных случаях я бы посмотрел в сторону pandas (если знак минус не спасает)
# либо можно использовать compare function (позволяет реализовать более сложные логики сортировки)
# compare function позволяет реализовать более сложные логики сортировки

IN Excel It is often more convenient to rely on collation stability to sort tables. IN Excel all sortings are stable. As shown in the example above, you can sort in reverse order, resulting in a multi-level sort. Although in Excel There is a built-in separate multi-level sorting tool, which is quite convenient.

5. Connecting sorted lists (optimization)

For example, we have 2 or more sorted python listsand you need to get one common one, which will also be sorted.

The straightforward solution is to simply combine the sheets and sort them again. But we don't take advantage of the fact that the sheets have already been sorted. It's not effective.

An efficient algorithm does something like this:

  1. Let's say both sheets are sorted in ascending order

  2. Create two pointers, place them at the beginning of each sheet

  3. Comparing pointers (objects from sheets with pointers on them)

  4. Adding a smaller element to a new sheet

  5. Move the pointer to the right, because this object has already been used

  6. We repeat

IN python there is also a module bisectwhich contains a number of useful functions in this context.

6. Sorting stability

Let's sort the numbers in ascending order [3, 1, 2]. It turns out [1, 2, 3]. It's simple here.

Okay, let's sort the numbers [3, 1, 1]. It turns out [1, 1, 3]. But here you need to understand that two numbers “one” are different objects. Therefore, the result, generally speaking, is not unambiguous. Units can be placed in two different ways, although this is not visible in this case – [1, 1, 3], [1, 1, 3].

A question may arise. If you can't see the difference, then why know it? The answer is simple: this is important because… Often we sort one list based on another. For example, in Excel table, if we sort on one column, we expect the entire table to be sorted. Here the difference will already be visible!

A small demonstration of sorting stability:

l = [
    (1, 'B'),
    (2, 'A'),
    (1, 'C'),
]

# ключ показывает, что хотим сортировать по элементу с индексом 0
sorted(l, key=lambda el: el[0])  

# -> [
#     (1, 'B'), # "B" идет до "С", как и в исходном списке. И это гарантируется!
#     (1, 'C'), 
#     (2, 'A')
# ]

You will notice that it now becomes important in what order the elements “B” and “C” will be. We see that they have maintained the order relative to the original list. And this is a fairly fundamental point.

If a sorting algorithm guarantees that the order of identical elements will not change, it is stable. Intuitively, everyone expects stable sorting (although rarely does anyone say so). Therefore, in high-level programming environments (excel, power query, power pivot etc.) the sorting is usually stable by default and cannot be changed. Already in python everything is not so simple, options appear.

For example, to guarantee sorting stability in pandas dataframeyou need to explicitly ask for it df.sort_values('some column', kind='stable'). Otherwise applies quick sortwhich is not stable!

sort_values ​​in pandas – NOT stable by default!

One of the algorithms that I wrote in pandas required that the sorting in front of it be stable, and out of habit I thought that this was the default behavior. I had to figure it out.

Built in python function sorted guarantees stability.

If sorting does not guarantee the preservation of order, it is unstable. Such sorting algorithms are usually faster. By the way, unstable sorts do not guarantee anything at all, i.e. order may or may not be maintained.

There are many sorting algorithms. For example, “Quick sort” is unstable, “Merge sort” is stable.

In general, the stability of sorting allows you to do interesting things:

# возьмем тестовый набор данных
l = [(1, 2, 3), 
     (2, 5, 8), 
     (2, 3, 7), 
     (2, 3, 66)]

# стабильная сортировка в данном случае ничего не меняет
sorted(l, key=lambda el: el[0])
# [(1, 2, 3), 
#  (2, 5, 8), 
#  (2, 3, 7), 
#  (2, 3, 66)]

# изменяем порядок на противоположный, только для одинаковых элементов
sorted(reversed(l), key=lambda el: el[0])
# [(1, 2, 3), 
#  (2, 3, 66), 
#  (2, 3, 7), 
#  (2, 5, 8)]

# перемешиваем только одинаковые элементы
import random
sorted(random.sample(l, k=len(l)), key=lambda el: el[0])
# [(1, 2, 3), 
#  (2, 3, 7), 
#  (2, 3, 66), 
#  (2, 5, 8)]

7. Key vs compare function

All sorting above, one way or another, came down to sorting integers. But the task of arranging objects cannot always be reduced to this.

Quite often in practice there are cases when there are no obvious clues. But there is a function that takes two objects (X, Y) and returns -1 (X меньше), 0 (X и Y равны), 1 (X больше). This function is called cmp (comparator/compare function).

Rarely in python, but the following code occurs:

from functools import cmp_to_key
# раньше я думал, что cmp_to_key какую-то магию делает
# т.е. казалось, что эта функция берет cmp и возвращает key function
# на самом деле cmp_to_key оборачивает объекты в искуственный класс
# а в методы сравнения помещает буквально нашу функцию - compare function
# т.е. это скорее просто технический момент. Чудес нет((

def my_cmp(X, Y):
    ...  # тут пишем нашу функцию сравнения

my_key = cmp_to_key(my_cmp)

sorted([...], key = my_key)  # сортировка будет сделана исходя из my_cmp логике
max([...], key = my_key)  # ключи актуальны и для min/max, конечно

Sorting via comparison function is a more complex case. Let's formulate the conditions under which this is generally possible:

  1. Any two objects from the list can be given to the function and get the result -1, 0, 1.

  2. Transitivity is satisfied (i.e. if the object A>B and B>C => A>C). An important condition that may often not be met.

    For example, three objects from a popular game: камень, ножницы, бумага – cannot be sorted by their “strength”. Because transitivity does not hold. See code below:

from functools import cmp_to_key
l = ['Paper', 'Rock', 'Scissors']  # три наши объекта

def compare_function(X, Y):  # принимаем два объекта, левый и правый
    # print(X, Y)
    if X == 'Rock':
        if Y == 'Scissors': return 1  # левый больше и т.д.
        elif Y == 'Paper': return -1  # правый больше
    elif X == 'Paper':
        if Y == 'Rock': return 1
        elif Y == 'Scissors': return -1
    elif X == 'Scissors':
        if Y == 'Paper': return 1
        elif Y == 'Rock': return -1

sorted(l, key=cmp_to_key(compare_function))  # запускаем
# ['Scissors', 'Rock', 'Paper']  # сортируется! удивляемся =)

# на самом деле python работает на основании допущения о транзитивности
# вообще говоря, он сделает только два сравнения, остальные будут выведены логически
# в этом можно убедиться поставив print в самом начале функции
# функция будет вызвана ровно 2 раза:
# Rock Paper
# Scissors Rock

# важно подчеркнуть, что Python НЕ проверяет, что транзитивность выполняется!
# Python верит вам на слово, что транзитивность выполняется :)
# Exception, raise Error НЕ будет! 
# код молча вернет первый попавшийся ответ
# И неважно, что он логически неверный

8. Total ordering

You may have seen something like this in python from functools import total_ordering.

This allows you to decorate the class and write only two comparison methods in it, and the rest will be inferred logically.

from functools import total_ordering 
  
@total_ordering
class num:  # игрушечный класс "Число"
    def __init__(self, value): 
        self.value = value 
    def __lt__(self, other):  # метод левое число больше
        return self.value < other.value 
    def __eq__(self, other):  # метод левое число равно правому
        return self.value == other.value 
          
print(num(2) < num(3))  # конечно True

It's interesting that the title total_ordering It was not taken by chance (I used to think that this was like a “total mega sorting, just in general!”). In fact, “Total order” is a set of properties of a binary relation on the set (wiki), which is needed to infer the remaining comparisons from the two data. Consider these properties:

  1. reflexive – for our purposes not very interesting

  2. antisymmetric – for any pair of objects, X and Y, exactly one of three statements is true: X is greater, Y is greater, or they are equal

  3. transitive – an important property, we met it in the section above (A>B and B>C => A>C)

  4. connected – any two objects X and Y can be compared (there are no pairs that are not compared)

The most dangerous property is transitivity. This is precisely what is often not done.

9. No transitivity – trouble

Usually, of course, there is transitivity. But in a number of special cases, it is not there. And this is not a theory, this occurs in practice:

  1. We have 1000 texts, we need to sort them from the most positive. We want to do it well and quickly. We use LLM (llama, qwen, etc.). LLM sees 2 texts at the same time and says who is more positive.

  2. The same sorting of texts by positivity using cross-encoders (bert-like neural network).

  3. We were given 1000 photos of kittens, we need to sort them somehow. It's difficult to come up with a key (although it's possible, of course). Rather, we will do pairwise comparisons (side-by-side). Logical contradictions will arise.

Moreover, it is clear that objects can be more or less sorted, but there will be errors that somehow need to be corrected. In this sense, you can sort the unsortable)

Option 1. Objects are compared sequentially. After each comparison, a transitive closure is made (transitive closure). This allows you to immediately indicate all comparisons that follow from already found comparisons. This is clearly seen in the example above – the game “rock, paper, scissors”. Python made two comparisons, the third comparison followed from them, which means it no longer needs to be done explicitly. Of course, this approach is non-deterministic (depending on the order of comparisons), but it’s better than nothing.

Option 2. You can also make a complete directed comparison graph. There will be cycles in it (a contradiction, since transitivity sometimes does not hold). The goal is to remove cycles by removing (=flipping) a minimum number of edges. This can help graph algorithm feedback_arc_set from the library igraph (unfortunately, in networkx I didn't find anything like that).

Option 3. Just take the Hamiltonian path of a complete directed graph of such comparisons. Even if there are contradictions, there will still be a path. This property tournament count (tournament graph).

The test for transitivity itself can be done like this:

Checking for transitivity (itertools, networkx)
import itertools as it
import networkx as nx

l = [3, 1, 2, 9, 5, 5, 5]  # тестовый список

# Шаг 1 (обычно тут проблем нет)
# Надо проверить, что отношения эквивалентности выполняется без противоречий
# Т.е. что равные объекты образуют полный подграф (complete graph)
# мы используем тот факт, что комплимент полного графа - пустой
# иначе ошибка, т.е. отношение эквивалентности нарушено
H = nx.Graph()
for X, Y in it.combinations(l, 2):
    if X == Y: H.add_edge(X, Y)  # добавляем ребра, если объекты равны
for cc in nx.connected_components(H):  # отдельно смотрим каждую компоненту связности
    assert nx.is_empty(nx.complement(H.subgraph(cc)))

# Шаг 2
# теперь мы можем убрать дубликаты
# строим направленный граф
l = list(set(l))
G = nx.DiGraph()  # создаем пустой направленный граф
for X, Y in it.combinations(l, 2):
    if X > Y: G.add_edge(X, Y)
    elif X < Y: G.add_edge(Y, X)
nx.draw_networkx(G)  # см. ниже скрин

# убираем все транзитивные зависимости
# должен остаться DAG, где самый длинный путь - это все наши объекты
G = nx.transitive_reduction(G)
assert nx.is_directed_acyclic_graph(G)
assert nx.dag_longest_path_length(G) == len(l) - 1
nx.draw_networkx(G)  # см. ниже скрин
before removing transitive dependencies

before removing transitive dependencies

after removing transitive dependencies

after removing transitive dependencies

10. Topological sorting

If we have DAG (directed acyclic graph), then the task may arise to arrange all its vertices into a flat list (~sort).

The result should be a list with certain properties. If we take two elements from the list X, Y (and X is to the left of Y), then either there is no directed path between them, or there is a path from X to Y, and not vice versa.

Topological sorting can be found in a variety of unusual places:

  1. When Excel It calculates formulas in cells, it does it sequentially. He needs to arrange all the cells into a flat list. Giant topological sorting helps him with this. Cells are the vertices of the graph, formula connections are the edges of the graph.

  2. When we write code Power Queryinside the structure let ... in The order of execution of lines of code is determined not from top to bottom, but by topological sorting (yes, this is unusual for code).

11. Sorting by meaning

In practice, there may be problems of sorting texts by meaning (usually short ones, for example, titles, headings, abstracts). By meaning we mean semantic vectors from Bert. Those. we must arrange the texts in some logical order from beginning to end.

Alternatively, you can do clustering (K-means, HDBSCAN etc.). But then you still need to sort the clusters themselves somehow. Plus, there is no obvious order within the cluster, it doesn’t work out neatly.

A simpler and more direct way is to reduce the dimension of vectors to 1-D (dimensionality reduction: PCA, T-SNE, TruncatedSVD) after which it will become automatically sorted. This method is a bit like reducing the dimension of vectors to 2-D for visualization purposes (scatter plot).

12. Combining sorts into one common one (averaging sorts)

Let's say there are many objects and different sortings are done on them (according to different logics). But we need to do a general final sorting.

Case Study: We Do RAG system, we select relevant text chunks for the user’s request. For each chunk there are two vectors: bert And tf-idf. That. all chunks will be sorted twice for one user request. There may be more sortings, of course. How to select chunks if there is more than one sorting?

The very first thought is to add up all the place indices for each chunk. And sort by this amount.

The second idea is to add not the place indices, but the numerical similarity estimates themselves (in the case of vectors, the cosine query-chunk distances). But here it is necessary to achieve comparability of these estimates. How to do this is not always obvious.

Another option is Reciprocal Rank Fusion. A slightly modified formula is used (the sum of the inverse place indices plus a correction factor). Interestingly, the authors in the article recommend k=60. Essentially, this coefficient regulates the strength of two ideas:

  • it is important to take first places, in other cases last places are possible

  • it is important to always take good places, even if not first

import numpy as np
from numpy.random import default_rng; rng = default_rng(seed=42)

arr = np.arange(10).reshape(1, -1)
arr = np.repeat(arr, repeats=5, axis=0)
arr = rng.permuted(arr, axis=1)
arr.shape  # 5 соревнований, 10 участников в каждом
# верхняя левая пятерка означает, 
# что в соревновании 0 участник 0 занял 5-ое место
arr  
# array([[5, 6, 0, 7, 3, 2, 4, 9, 1, 8],
#        [4, 8, 2, 6, 5, 9, 7, 3, 0, 1],
#        [0, 9, 8, 4, 7, 2, 3, 5, 6, 1],
#        [6, 1, 2, 7, 9, 5, 8, 4, 0, 3],
#        [3, 2, 6, 5, 1, 4, 8, 9, 0, 7]])

# базовый способ
arr.sum(axis=0).argsort()
# array([8, 0, 2, 9, 5, 4, 1, 3, 6, 7], dtype=int64)
# самый лучший по итогу - участник с индексом 8

k = 200  # обычно 60 ставят
# k позволяет балансировать между двумя условными подходами:
# занял первое место - остальное не важно
# лучше занять два раза второй место

(arr / (1 + k)).sum(axis=0).argsort()  # результат немного другой
# array([8, 2, 0, 9, 5, 4, 1, 3, 6, 7], dtype=int64)

13. Assessing the quality of predicted sorting (RAG)

When we build RAG we need to evaluate the quality of the search for relevant parts of the text (chunks).

For example, we have a reference sorting of the first 10 documents relevant to a specific query. We can compare them with our search results.

We calculate the comparison using the MAP@K formula (see website evidentlyai). It allows you to calculate a sorting quality metric taking into account the number of objects in the correct places and taking into account the “importance” of positions (it is more important to guess the first position than the second, etc.).

import numpy as np
from numpy.random import default_rng; rng = default_rng(seed=42)

arr = np.arange(10).reshape(1, -1)
arr = np.repeat(arr, repeats=5, axis=0)
arr = rng.permuted(arr, axis=1)
arr  # дано пять сортировок (предсказания модели). Каждая сортировка 10 объектов
# array([[5, 6, 0, 7, 3, 2, 4, 9, 1, 8],
#        [4, 8, 2, 6, 5, 9, 7, 3, 0, 1],
#        [0, 9, 8, 4, 7, 2, 3, 5, 6, 1],
#        [6, 1, 2, 7, 9, 5, 8, 4, 0, 3],
#        [3, 2, 6, 5, 1, 4, 8, 9, 0, 7]])
# верхняя левая пятерка означает, что первая сортировка считает главным 
# документ с индексом 5

ground_true = rng.permuted(arr, axis=1)
ground_true  # допустим, это корректные результаты, корректные пять сортировок
# array([[5, 2, 1, 6, 0, 9, 8, 4, 3, 7],
#        [6, 3, 5, 1, 8, 0, 7, 2, 4, 9],
#        [0, 3, 2, 4, 7, 6, 1, 8, 9, 5],
#        [4, 8, 7, 0, 1, 6, 5, 2, 3, 9],
#        [3, 6, 1, 8, 7, 2, 5, 0, 9, 4]])

# попробуем вычислить качество предсказаний модели
temp = arr == ground_true
temp2 = np.repeat(np.arange(1, 10+1).reshape(1, -1), repeats=5, axis=0)
scores = np.mean(temp * temp.cumsum(axis=1) / temp2, axis=1)

scores  # отдельная оценка по каждой из пяти сортировок
# array([0.1       , 0.01428571, 0.21      , 0.        , 0.1       ])
# лучшая сортировка с индексом 2

14. LLM writes sorting in VBA (accounting for data types)

IN VBA (MS Excel) there is no sorting (or rather, there is, if you use ArrayListbut this external library). You can also write the range onto a sheet, sort it, and then read it back. But it's an incredible crutch.

You have to write it yourself VBA. This is very frustrating( I asked gpt o1-preview to do this for me.

The promt was like this:

We write VBA code to process table data.
Table are modeled as collection of dictionaries in VBA. Nested dictionaries have key as column names. Dictionaries values ​​are table row items.
You task is to write VBA function which takes table as an argument and execute multilevel sorting (on 2 or more columns, ascending or descending). Columns may have different data types. Column may have mixed data type. Columns always have scalar types in them (no arrays, collections etc.)
Function should return sorted table in the form of collection of dictionaries.
Sorting must be stable.

GPT's response
'Certainly! Below is a comprehensive VBA solution that addresses your requirements for sorting a table represented
'as a `Collection` of `Dictionaries`. The function `MultiSortTable` allows multi-level sorting on specified columns
'with ascending or descending orders while ensuring stability.
'
'### Overview
'
'1. **Data Structure**:
'    - **Table**: A `Collection` where each item is a `Scripting.Dictionary` representing a row.
'Each dictionary has key-value pairs where keys are column names and values are the corresponding cell data.
'    - **Sort Criteria**: An array of sort specifications, where each specification is a two-element array containing
'the column name and the sort order (`"asc"` or `"desc"`).
'
'2. **Approach**:
'    - **Copy to Array**: Convert the `Collection` to a VBA array for efficient indexing.
'    - **Merge Sort**: Implement a stable merge sort tailored for multi-level sorting with mixed data types.
'    - **Comparison Logic**: Define a robust comparison method that handles different data types and respects the sort order.
'    - **Reconstruct Collection**: Create a new sorted `Collection` from the sorted array.
'
'### Implementation

'Option Explicit

' Function to perform multi-level stable sort on a table represented as a Collection of Dictionaries

Function MultiSortTable(table As Collection, sortCriteria As Variant) As Collection
    ' sortCriteria is expected to be a 1-based array where each element is a 2-element array:
    ' sortCriteria(i)(0) = column name (String)
    ' sortCriteria(i)(1) = sort order ("asc" or "desc")
    
    Dim rowCount As Long
    rowCount = table.Count
    If rowCount <= 1 Then
        ' No sorting needed
        Set MultiSortTable = table
        Exit Function
    End If
    
    ' Copy Collection to Array for sorting
    Dim rows() As Variant
    ReDim rows(1 To rowCount)
    Dim i As Long
    For i = 1 To rowCount
        Set rows(i) = table(i)
    Next i
    
    ' Perform Merge Sort
    MergeSort rows, LBound(rows), UBound(rows), sortCriteria
    
    ' Copy sorted array back to Collection
    Dim sortedTable As New Collection
    For i = 1 To rowCount
        sortedTable.Add rows(i)
    Next i
    
    Set MultiSortTable = sortedTable
End Function

' Merge Sort Implementation
Private Sub MergeSort(arr() As Variant, left As Long, right As Long, sortCriteria As Variant)
    If left < right Then
        Dim mid As Long
        mid = (left + right) \ 2
        MergeSort arr, left, mid, sortCriteria
        MergeSort arr, mid + 1, right, sortCriteria
        Merge arr, left, mid, right, sortCriteria
    End If
End Sub

' Merge procedure for Merge Sort
Private Sub Merge(arr() As Variant, left As Long, mid As Long, right As Long, sortCriteria As Variant)
    Dim n1 As Long, n2 As Long
    n1 = mid - left + 1
    n2 = right - mid
    
    ' Create temporary arrays
    Dim L() As Variant, R() As Variant
    ReDim L(1 To n1)
    ReDim R(1 To n2)
    
    Dim i As Long
    For i = 1 To n1
        Set L(i) = arr(left + i - 1)
    Next i
    For i = 1 To n2
        Set R(i) = arr(mid + i)
    Next i
    
    ' Merge the temporary arrays back into arr
    Dim j As Long, k As Long
    i = 1
    j = 1
    k = left
    
    Do While i <= n1 And j <= n2
        If CompareRows(L(i), R(j), sortCriteria) <= 0 Then
            Set arr(k) = L(i)
            i = i + 1
        Else
            Set arr(k) = R(j)
            j = j + 1
        End If
        k = k + 1
    Loop
    
    ' Copy any remaining elements of L
    Do While i <= n1
        Set arr(k) = L(i)
        i = i + 1
        k = k + 1
    Loop
    
    ' Copy any remaining elements of R
    Do While j <= n2
        Set arr(k) = R(j)
        j = j + 1
        k = k + 1
    Loop
End Sub

' Function to compare two rows based on sort criteria
' Returns:
'   -1 if row1 < row2
'    0 if row1 = row2
'    1 if row1 > row2
Private Function CompareRows(row1, row2, sortCriteria As Variant) As Long
    Dim i As Long
    For i = LBound(sortCriteria) To UBound(sortCriteria)
        Dim colName As String
        Dim sortOrder As String
        colName = sortCriteria(i)(0)
        sortOrder = LCase(sortCriteria(i)(1))
        
        Dim val1 As Variant, val2 As Variant
        val1 = row1(colName)
        val2 = row2(colName)
        
        Dim cmp As Long
        cmp = CompareValues(val1, val2)
        
        If cmp <> 0 Then
            If sortOrder = "asc" Then
                CompareRows = cmp
            ElseIf sortOrder = "desc" Then
                CompareRows = -cmp
            Else
                Err.Raise vbObjectError + 1, "CompareRows", "Invalid sort order: " & sortOrder
            End If
            Exit Function
        End If
        ' If equal, continue to next sort key
    Next i
    ' All sort keys equal
    CompareRows = 0
End Function

' Function to compare two values of potentially different types
' Returns:
'   -1 if val1 < val2
'    0 if val1 = val2
'    1 if val1 > val2
Private Function CompareValues(val1 As Variant, val2 As Variant) As Long
    ' Handle cases where either value is missing or Null
    If IsMissing(val1) Or IsNull(val1) Then
        If IsMissing(val2) Or IsNull(val2) Then
            CompareValues = 0
            Exit Function
        Else
            CompareValues = -1
            Exit Function
        End If
    ElseIf IsMissing(val2) Or IsNull(val2) Then
        CompareValues = 1
        Exit Function
    End If
    
    ' Determine data types
    Dim type1 As Long, type2 As Long
    type1 = VarType(val1)
    type2 = VarType(val2)
    
    ' If same type, compare accordingly
    If type1 = type2 Then
        Select Case type1
            Case vbInteger, vbLong, vbSingle, vbDouble, vbCurrency, vbByte, vbDecimal
                If val1 < val2 Then
                    CompareValues = -1
                ElseIf val1 > val2 Then
                    CompareValues = 1
                Else
                    CompareValues = 0
                End If
            Case vbDate
                If val1 < val2 Then
                    CompareValues = -1
                ElseIf val1 > val2 Then
                    CompareValues = 1
                Else
                    CompareValues = 0
                End If
            Case vbString, vbFixedString, vbWideString
                CompareValues = StrComp(CStr(val1), CStr(val2), vbTextCompare)
            Case vbBoolean
                If val1 = val2 Then
                    CompareValues = 0
                ElseIf val1 = False Then
                    CompareValues = -1
                Else
                    CompareValues = 1
                End If
            Case Else
                ' For other types, convert to string and compare
                CompareValues = StrComp(CStr(val1), CStr(val2), vbTextCompare)
        End Select
    Else
        ' Different types: define a type precedence
        ' Define precedence: Number < Date < String < Boolean < Others
        Dim precedence1 As Long, precedence2 As Long
        precedence1 = TypePrecedence(type1)
        precedence2 = TypePrecedence(type2)
        
        If precedence1 < precedence2 Then
            CompareValues = -1
        ElseIf precedence1 > precedence2 Then
            CompareValues = 1
        Else
            ' Same precedence, attempt to compare as strings
            CompareValues = StrComp(CStr(val1), CStr(val2), vbTextCompare)
        End If
    End If
End Function

' Function to define type precedence for mixed type comparison
Private Function TypePrecedence(vbType As Long) As Long
    Select Case vbType
        Case vbInteger, vbLong, vbSingle, vbDouble, vbCurrency, vbByte, vbDecimal
            TypePrecedence = 1
        Case vbDate
            TypePrecedence = 2
        Case vbString, vbFixedString, vbWideString
            TypePrecedence = 3
        Case vbBoolean
            TypePrecedence = 4
        Case Else
            TypePrecedence = 5
    End Select
End Function

'### How It Works
'1. Function MultiSortTable:
'    - Input:
'        - table: The table to sort, represented as a Collection of Dictionaries.
'        - sortCriteria: An array where each element is a two-element array specifying the column name and sort order.
'    - Process:
'        - Converts the Collection to an array for easier manipulation.
'        - Applies a stable merge sort to the array based on the provided sort criteria.
'        - Reconstructs and returns a new sorted Collection.
'
'2. Merge Sort (MergeSort and Merge):
'    - A recursive implementation that divides the array and merges them back in sorted order.
'    - Utilizes the CompareRows function to determine the order between two rows based on multiple sort levels.
'
'3. Comparison Logic (CompareRows and CompareValues):
'    - CompareRows: Iterates through each sort criterion and compares the corresponding values in two rows.
' If a difference is found, it returns the comparison result adjusted for ascending or descending order.
'    - CompareValues: Handles the comparison between two individual cell values, considering their data types. It ensures that different types are compared in a consistent and predictable manner by defining a type precedence.
'    - TypePrecedence: Assigns a precedence value to different data types to facilitate mixed-type comparisons.
'
'### Usage Example
'
'Here 's how you can use the MultiSortTable function in your VBA project:

'VBA

Sub ExampleUsage()
    Dim table As New Collection
    
    ' Example: Creating sample data
    Set row = CreateObject("Scripting.Dictionary")
    row.Add "Name", "Alice"
    row.Add "Age", 30
    row.Add "JoinDate", #1/15/2020#
    table.Add row
    
    Set row = CreateObject("Scripting.Dictionary")
    row.Add "Name", "Bob"
    row.Add "Age", 25
    row.Add "JoinDate", #3/22/2021#
    table.Add row
    
    Set row = CreateObject("Scripting.Dictionary")
    row.Add "Name", "Charlie"
    row.Add "Age", 30
    row.Add "JoinDate", #2/10/2019#
    table.Add row
    
    ' Define sort criteria: first by Age ascending, then by JoinDate descending
    Dim sortCriteria(1 To 2) As Variant
    sortCriteria(1) = Array("Age", "asc")
    sortCriteria(2) = Array("JoinDate", "desc")
    
    ' Perform multi-level sort
    Dim sortedTable As Collection
    Set sortedTable = MultiSortTable(table, sortCriteria)
    
    ' Display sorted results
    For Each sortedRow In sortedTable
        Debug.Print sortedRow("Name") & ", " & sortedRow("Age") & ", " & sortedRow("JoinDate")
    Next sortedRow
End Sub

'Expected Output:
'Bob , 25, 3 / 22 / 2021
'Alice , 30, 1 / 15 / 2020
'Charlie , 30, 2 / 10 / 2019

'In this example:
'- The table is first sorted by Age in ascending order.
'- For rows with the same Age, it further sorts them by JoinDate in descending order.
'- The sorted output maintains stability, meaning that the original order is preserved where criteria are equal.
'
'### Notes
'- Error Handling: The function raises an error if an invalid sort order is provided.
'Ensure that sort orders are specified correctly as "asc" or "desc".
'- Performance: While this implementation is suitable for small to moderately sized datasets,
'performance might degrade with very large tables due to the nature of VBA and the merge sort implementation.
'- Data Types: The comparison logic handles common data types.
'If your data includes more complex types, you may need to extend the CompareValues function accordingly.
'
'Feel free to integrate and modify this code to better fit the specific nuances of your project!

A few interesting points:

  1. GPT uses the approach compare function (comparison objects are table rows). This is a universal approach. Using key here will be almost impossible, because column data types are not known in advance

  2. The code was 99% working (Scripting.Dictionary was declared incorrectly)

  3. The code took into account multi-level sorting and ascending/descending parameters

  4. The sorting, as I asked, was stable (Merge sort algorithm)

  5. GPT tried to work out the situation of different data types. It didn't turn out very well, but overall it was logical

The issue of comparing objects with different data types is a complex one. Let me remind you that the main VBA data types are:

  • string – strings of any length

  • single – fractional units of single precision

  • double – unit precision fractions

  • date – dates

  • integer – whole

  • long – long integer

  • and a number of others

GPT allowed objects to be compared only if they were strictly the same data type (e.g. double only with double). But, for example, integer and long also seem to be comparable without anomalies, it was possible to compare them. It was also possible to compare different precisions of fractional numbers and fractional-integer ones, but there may be some peculiarities here. Python I could do this too True > 0.5. If I were writing, I would try to take this into account, because… intuitively a person will expect this functionality.

For example, if in pandas series will int And float objects, they will be compared exactly like numbers.

If the data types do not match, GPT simply established some, pre-known order of the data types (All number types < Date < String < Boolean < Others).

And if the data types do not match, but are at the same priority level (the order of the data types themselves), then everything is converted to strings (str) is forced and tries to compare. Those. at two sites 21 Single And 3 Double we will get a completely illogical result (since "3" > "21").

Therefore, when sorting, it is important to pay attention to the types of data that may end up in your list and how we will compare them.

15. Linearization (put objects in a row)

Often in different areas the task arises to iterate/traverse all objects. It's not exactly sorting, but it's similar.

Areas where this can be found:

  1. Graph Traversals (DFS, BFS)

  2. Tree traversals (pre-order, in-order, post-order)

  3. MRO (method resolution order) V python determines the order of method calls in classes with multiple inheritance (algorithm С3)

16. List sorting measure

Let's take a list A = [1, 2, 3, 4, 5]. It's sorted. Contains 5 elements. Number of permutations 5! = 120.

Now let's take two other lists B = [2, 1, 3, 4, 5] And C = [5, 4, 3, 2, 1]. It's obvious that B more orderly compared to A. But how can this be measured numerically?

Remembering that sorting is essentially a permutation, we can find out the index of the permutation. This is possible because All 120 permutations are numbered. Although there are different approaches to numbering the 120 permutations, we use the classic one.

import more_itertools as mit  # мега крутая библиотека, всем рекомендую!

l = [1, 2, 3, 4, 5]

import math; math.factorial(5)
# 120 - всего 120 (индексы от 0 до 119)

mit.permutation_index(l, [1, 2, 3, 4, 5])
# 0 - понятно, что сам список имеет индекс 0 (самый первый)

mit.permutation_index(l, [2, 1, 3, 4, 5])
# 24 - чуть-чуть перемешали, недалеко от 0

mit.permutation_index(l, [5, 4, 3, 2, 1])
# 119 - перемешали сильно, по сути максимально (хотя этот термин и не определен строго)
# максимально далеко от 0

In this sense, the sorting index can serve as a measure of the distance between two permutations, in our case, between an ordered state and a jumbled state. More details about the combinatorial properties of permutations can be found in the documentation sympy.

17. Sets (ordered set, sorted set)

Python has a great data structure – the set (set). There is also a multi-set in the form collections.Counter.

The only thing I've always been missing is the presence of ordered sets and operations between them (which guarantee that the ordering/sorting is preserved). Yes, you can implement it yourself. But I want something ready-made, proven. Here's what I usually use in such cases:

# выручает pandas со своими индексами
# там реализованы базовые difference, intersection, union
# и они сохраняют порядок! ровно то, что нужно
import pandas as pd
pd.Index([1, 2, 3, 5, 6]).difference(pd.Index([2, 3]))
# Index([1, 5, 6], dtype="int64")  сохранение порядка НЕ случайно. Это гарантируется

# есть и другие способы, в целом выдают ожидаемый результат
# pip install ordered-set
# pip install boltons
# pip install collections-extended

18. Two final thoughts

  1. Mathematics is everywhere (we touched on several of its sections above)

  2. Sometimes you have to delve into the details (sorting had to be “written” in VBA). Although before I often heard the thesis that this would not be needed in real work

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *