Algorithm for forming fractional indices

In this article, I will try to explain the process of developing and optimizing an algorithm for constructing fractional indexes using simple logical reasoning. Over the course of this article, we'll delve into the intricacies of the algorithm and its possible applications, touch on the topic of optimizing index size in edge cases, and also look at how to modify the algorithm to support simultaneous use by many users.

Formulation of the problem

The goal is to sort the records with minimal disruption to the existing sequence. Consider a scenario in which a collection of rows is ordered based on an index field, and the goal is to move or insert a new row without affecting other records.

Why might we need this?

This issue can occur in a variety of applications, especially in the context of cloud database management. In such environments, where changing strings incurs financial costs, the importance of minimizing them becomes obvious. Performing a complete row renumbering after each permutation can introduce significant overhead, so it's worth considering whether a more efficient approach is needed.

The problem is even more acute in peer-to-peer systems, such as text editing. Here, renumbering all rows can lead to conflicts during synchronization with other nodes, which requires a strategy to resolve them. By focusing exclusively on rows that are affected by user actions, we aim to reduce the occurrence of conflicts and improve system stability.

The question arises: is it possible to achieve such sorting with minimal disruption to the existing sequence?

The idea of ​​​​building an algorithm

Of course, the concept of fractional indexing is a simple and intuitive solution to our problem. Imagine writing down a list on paper and realizing that you forgot or missed one item. Instead of renumbering the entire sequence, you can simply write it between the lines and denote it with a fractional index, such as 1.5.

This approach is the basis of fractional indexing, but unfortunately its direct application is hampered by the limitations of floating point representation. Floating point numbers have finite precision, so we can't divide them indefinitely, and we'll quickly run into limitations.

To unify this concept, we introduce the concept of unreachable bounds within an index range. Assuming the rows are sorted in ascending index order, we denote zero as the unreachable upper bound and one as the unreachable index lower bound.

We will calculate each new index as the arithmetic mean between two nearest neighbors. For example, to insert a string into an empty list, we use both unreachable bounds and the first index will be equal to(0+1)/2=0.5.
Similarly, when inserting a row on top of an existing one, the new index is calculated as the midpoint between the unreachable upper bound and the index of the previous row(0+0.5)/2=0.25.
The insertion between existing rows is the average of their indices, which in this case gives(0.25+0.5)/2=0.375.

Taking a closer look at the resulting indices, we note that the “zero point” prefix is ​​common to all indices and can be neglected. Moreover, if you think of index tails as strings or byte arrays, you will find that they are ordered lexicographically. This feature allows us to go beyond conventional numeric indexes and use extended character sets. For example, such as base64 or even arbitrary bytes, provided that our application or database supports lexicographic sorting of such arrays.

Insert between two indexes

How to calculate a new index value between two existing ones? As an example, consider byte arraysP_1=[0A\ 58]AndP_2=[7B\ CD\ F2]. Our approach uses the concept of rational numbers, where trailing zeros do not affect the value. For example, 0.1 and 0.100 are the same number. This allows us to adjust the length of the indexes, adding zeros if necessary.

Aligning array lengths is essentially multiplying our rational numbers by some common base so that they become integers. At the same time, considering the length-aligned indices as large integers, we can determine their arithmetic mean:

P_n = \frac{P_1+P_2}{2} = \frac{P_1}{2} + \frac{P_2}{2} = (P_1 \gg 1) + (P_2 \gg 1)

As can be seen from the formula above, to do this, it is enough to perform two simple operations with byte arrays: addition and shift to the right by one bit. Both operations are easy to implement for an arbitrary set of bytes. You just need to perform an operation on a pair of bytes, and transfer the rest to the next one, and so on. It is important to note that there is no need to save the entire number. As soon as a byte is encountered that is different fromP_1AndP_2all subsequent bytes become insignificant and can be discarded.

For example, using this method in our example produces a new index:

P_n = [0A\ 58\ 00] \gg 1 + [7B\ CD\ F2] \gg 1 = [43\ 12\ F9] \approx [43]

Memory assessment

The worst-case scenario for the algorithm occurs when new rows are continually inserted at the same position. With each insertion, the range between indices narrows, resulting in an increase in its length. But how fast is this growth happening?

After implementing the algorithm with byte arrays and performing 10,000 insertions at the beginning of the list, I found that the maximum index size reached 1250 bytes. Thus, each insertion increases the length of the index by only one bit.

This is an excellent result because one bit represents the minimum amount of information and is difficult to improve. In fact, almost all descriptions of the algorithm that I found on the Internet end there. However, it is important to separately consider edge cases, especially insertions at the very beginning or end of the list. In such situations, there is one open boundary that may provide an opportunity for optimization.

For example, imagine a peer-to-peer text editor, something like Notepad, but the rows are sorted by fractional index. Every time we add a new row, our index increases by one bit. If you insert lines in the middle, there's nothing you can do about it. But when writing text, adding lines at the very end may be a more likely and natural option. Thus, by optimizing the addition of a new index at the end of the list, we can reduce the overhead of storing it.

Inserting at the end of a list

Let's consider a simple case in which the last index of our list is denoted asP_1=[80\ AB]and we want to create the following indexP_n. If we use the previous algorithm, we will get the new index value asP_n=[C0]. However, upon closer inspection, it is obvious that this is too big a step. Instead, simply increase the first byte by one.

Given that the first index added is exactly in the middle of the range, this observation provides approximately 127 insertions to the end (or beginning) of the list per first byte of the index. Which corresponds to an approximate value of 0.06 bits per insert.

Moreover, thanks to the property of rational numbers, we can treat all subsequent bytes as zero, which allows us to perform an additional 255 inserts for each subsequent byte of the index. This corresponds to a value of approximately 0.03 bits per insertion.

It is important to note that with this modification of the algorithm, the index is increased by one byte only when the byte boundary values ​​are reached (FF for insertion at the end or 00 for insertion at the beginning of the list). In other words, the slower we get into these values, the less often there is a need to add a new byte.

Thus, we can greatly increase efficiency by using byte pairs for increments. This approach allows us to achieve an incredible 0.0001 bits per new index. The most important thing in this case is to use the first byte, which is less than FF, and the next one after it as a pair for the increment.

It turns out that edge cases of insertion can be handled much more efficiently compared to the basic algorithm. However, this optimization comes at the cost of a couple of leading index bytes.

For our Notepad example, this means that indexes will increase by one bit when inserting into the middle of text, and adding a line to the end or beginning will be practically free.

Another example where this approach might work well is in an application with a list of tasks or priorities. Imagine a list of three tasks that you constantly change in random order. Without optimization, each move costs you one bit of index. But since there are only three tasks on the list, the likelihood that we will move one of them to the very beginning or end is quite high. And this optimized case will automatically truncate the index to its original byte size.

Parallel changes

Another topic that I would like to touch upon is the simultaneous editing of a list by several users. What happens if two independent clients try to insert a row in the same place at the same time?

Let's look at the simplest scenario with a list of two lines:P_1=[05\ 12]AndP_2=[07\ 0A]. Let's say two independent clients are trying to insert a new row betweenP_1AndP_2.

According to our algorithm, given the same input data, both clients will receive the same values ​​of the inserted indexes:P_a=P_b=[06]. Two important problems arise here. First, it results in undefined line order. Since the indices are identical, it becomes impossible to determine which one should be higher. Secondly, and more importantly, identical indexes prevent us from inserting anything between them.

To solve this problem, you need to ensure that the generated indexes are unique. To do this, we take advantage of an interesting feature of our indexes: if we have a list of unique values, adding any suffix to any index will not change the order of the rows. Therefore, we can introduce small unique variances for each client, ensuring that the values ​​generated are unique.

Such unique suffixes can be generated either every time a record is created, or once when the application is launched. The length of suffixes can vary to balance the likelihood of a collision with the additional memory required.

Bit number

0

1

2…22

23…47

Meaning

0

1

Random value

Current time in seconds

For example, in one of my projects I implemented the generation of a unique suffix 6 bytes long using the following rules:

  • The first two zero-one bits are constant. This is necessary to eliminate degenerate cases where the suffix consists only of zeros or ones. Because we know that such suffixes can significantly affect the length of the index at the beginning or end of the list.

  • The next 21 bits are random. This order allows us to slightly reduce the expected length of the index, since when inserting a string between two existing ones, we cut off the tail as soon as we find the first byte that is different.

  • And the last 25 bits are a truncated Unix timestamp. This is essentially a one-year cycle, but it significantly reduces the likelihood of generating duplicate suffixes since I do the calculation once when I run the application.

Conclusion

Thus, we have developed an algorithm that allows you to efficiently sort records with minimal changes to the list. We optimized cases of insertion at the beginning and end of a list, ensuring an optimal index size for such operations. We also thought about how to ensure the uniqueness of indexes to solve the problems of simultaneous editing by several clients.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *