Sparse Data Structures

Thin skies over Diana and Bojack Horseman

I once wrote a post about various interesting data structures. Among them was the so-called. sparse set. There we described it in general terms, omitting some details (with which the article was later supplemented). But besides sparse set, there are other sparse data structures! Let’s take a look at them today 🙂

Sparse Array (Sparse Matrix)

When we talk about sparse arrays and matrices, we most often mean not storing elements with some value (for example 0 or None/null). For example, let’s look at some vector (in the mathematical sense):

$A = \begin{pmatrix} 1 & 0 & 0 & 4 & 0 & 2 & 0 & 0 \end{pmatrix}$

Or to the matrix:

$B = \begin{pmatrix} 1 & 1 & 0 & 0 & 0 \\ 4 & -2 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ \end{pmatrix}$

Like in vector Aand in the block matrix IN, most elements are zero. We can save money in such cases by storing only non-zero values. The most understandable and simple solutions are to transform it into an associative array:

$f: i -> value, for\ array\\ f: (i, j) -> value, for\ matrix” src=”https://habrastorage.org/getpro/habr/upload_files/0d1/084/5b8/0d10845b874af9daba4fa0744bbf8ccc.svg” width=”697″ height=”59″/></p><p>Or transform into several arrays: in the case of a sparse array – one for the coordinate and one for the value; and in the case of a sparse matrix, two for coordinates and one for value. For example, for the matrix above, such a transformation might look like this:</p><p><img data-lazyloaded=$

Or transform it into a list (single/doubly linked), where the node contains all the coordinates and values at once. It doesn’t matter. The concept is the same everywhere.

Such approaches help save time and memory when calculating with some types of matrices: for example, in the case of block/sometimes triangular/diagonal. There are ready-made packages for python that implement this logic (scipy.sparse).

There are certainly no catchy ideas here. But it gets more interesting!

Sparse List

Sparse list is sparse in another sense. In which? I won’t tell you yet.

Let’s look at a doubly linked list. We can do modifying operations (insertion, deletion) in O(1) and non-modifying (bypass in any direction) for O(n). And for all this we need only one node of the list (no need to look for a neighboring one, as in the case of a singly linked list. On the other hand, we are forced to store two pointers for each node (in modern realities this is most often as many as 16 bytes!). Is it possible somehow not significantly degrade the speed of operation, while reducing memory consumption?

Yes! What if we made a sesquilinked list? Well, that is, its connectivity is something between one and two. We can achieve this by having an average singly linked list, in which some nodes will be doubly linked.

That is, we have a certain subset of nodes with two pointers, one of which points to the previous doubly connected node, and the other to the next node in the list. Because the size of a simply connected “sublist” is finite and fixed in advance, any operation still takes O(1), just like a doubly linked list (of course, with a greatly increased constant). A thoughtful reader may notice that you can make insertions into one simply connected sublist and all our asymptotics will break! But no one is stopping us from maintaining the size of these simply connected sublists no more than a certain size (in picture 3, if we count the doubly connected nodes at the ends). That is, if you have already inserted a singly connected node after a singly connected one, then make it doubly connected and correct the links of the doubly connected neighbors.

Now, of course, you need a little more squatting in the implementation of each of the operations, but still you do not need to start from the very beginning of your list to find the previous node for some known one. You just need to go forward and find the first doubly connected node from which you can jump back.

Actually, due to the fact that your doubly connected nodes are located at some distance from each other, it is called sparse.

How to choose this “sparseness coefficient”? There is no good answer. Empirically, you can guess that writing such a structure is much simpler (and the constant is smaller) if the nodes simply alternate. It will be especially cheaper to traverse the list in reverse order if doubly connected nodes are more common.

How can you save memory? From a coding perspective, you might want to use something like union/std::variant, but such types will have the size of the largest of the objects. That is, in the end we still have a solution comparable to a doubly linked list.

Unfortunately, it’s not pretty. In any case, you need to be able to work with nodes of various sizes (type-erasure in some form can help here). Plus, you need to place the pointer to the previous node last in the structure, so that it is more convenient to take the pointer to the next node/value from the list. But the task of determining what type a node is is not difficult. Because pointers are memory aligned, you always have some zero least significant bits. Just at the pointer to the next element, you can take one of these bits and set it to 1 if the node also has a reverse pointer (it is important not to forget to set the bit to zero before using this pointer to the next node). And problems also begin when you want to move a node from one list to another: since it can change its type, you will need to free/allocate memory again. If you wish, you can not store values at all in doubly connected nodes, but the problems almost never go away + it’s not clear how much we’re saving anything here at all.

Here link for implementation, if it’s very interesting to poke around.

Sparse Deque (sparque)

Sparse deque recently posted one of the reddit users. We’ll go over some basics to understand how it works.

First, we need to understand what a B-tree is (you can read it here). A B-tree, roughly speaking, is an m-ary search tree. That is, each node can store several values and, accordingly, several children. This allows you to do standard operations for binary search trees, but with a larger logarithm. Data locality also helps to traverse elements faster. Of course, although the depth of the tree becomes smaller, we should not forget about the emerging constant associated with a larger number of comparisons. It looks something like this:

Next we have B+-trees. These are B-trees that imitate an associative array. They store values only in the leaves, and in the remaining vertices only copies of the keys are stored (it turns out something like an associative array):

The leaves of B+ trees typically store a pointer to the next leaf, allowing you to traverse all elements of your tree in linear time without going up or down the tree.

And the last piece is counted B-tree – B-tree, in which for each child we store the size of its subtree:

Here sparque is a counted B+ tree. With the caveat that if in a canonical B+ tree we have both keys and values, here we store only the values (in the leaves). In the remaining vertices we only have the sizes of the child subtrees and the actual pointers to these subtrees themselves. So at the bottom level (leaves), our data structure looks like a list of blocks, each of which stores several values. Why not deque! True, each block in a sheet is something like an array into which you can effectively insert from both ends, so now some operations become faster.

Here they stand like this:

access by index O(log_m(n)) – since now we, using the dimensions of the subtrees, need to go down from the root of the tree to its leaf. It’s longer than standard std::deque with him O(1)but is in some way a tradeoff to speed up other operations;
the author states that insertion/deletion at the beginning/end is O(1). It sounds doubtful, because, in addition to insertion, you also need to recalculate all the sizes of subtrees from leaf to root. What are they does in push_back in function update_counts_plus. But maybe I didn’t understand something ¯\_(ツ)_/¯
insert deletion in a random place – O(m)where m is the number of values in the sheet block;
iteration – O(n).

Memory O(n).

Why the author decided to call this structure sparse is not known for certain.

Here There are benchmarks from the author in comparison with other containers. A here – implementation.

Sparse Set

Let’s remember what a sparse set is.

Sparse set allows you to perform all the same operations as any other set you are familiar with. But from a point of view it is still closer to std::vector<bool>/std::bitset. First we need two arrays and a size:

template <int R>
struct SparseSet {
	int size_ = 0;
	int dense_[R];
	int sparse_[R];
};

dense_ this is an array of values. sparse_ stores the index of the value in dense_, i.e. If dense_[i] = vThat sparse_[v] = i (instead of an index, you can store a pointer). Then the addition looks like this:

void Add(int val) {
	dense_[size_] = val;
	sparse_[val] = size_;
	++size_;
}

For a better understanding, let’s look at an example. Let’s say we inserted the following values into our set: 3, 2, 5, 1, 0.

Wherein size_ equals 5. We have nothing in empty cells.

Now you can check for availability in your network:

bool Contains(int val) {
	return sparse_[val] < size_ && dense_[sparse_[val]] == val;
}

If we want to check for the presence of a two, then our expression will look like this:

sparse_[2] < 5 && dense_[sparse_[2]] == 2

What is equivalent

1 < 5 && 2 == 2

Two in a set!

For values that are not in our set, we will check the garbage cells (if there was nothing in them before) and it is unlikely that our condition will pass (there is still garbage there).

If we want to remove some element, then we need to do a small squat. Let’s assume that we want to remove 0. It’s simple. We just need to reduce the size and this zero will be erased the next time we insert it. What if we want to remove an element that is not at the end dense_? Then we need to change the corresponding element in dense_ with the latter and support the invariant in sparse_. For an example with removing twos:

Now you can safely reduce size_. In code it will look like this:

 bool Remove(int i) {
   if (dense_[size_ - 1] != i) {
     int last_element = dense_[size_ - 1];
     dense_[sparse_[i]] = last_element;
     sparse_[last_element] = sparse_[i];
   }
   --size_;
 }

And of course, if we want to clear the entire data structure, we only need to do one single action:

void Сlear() { size_ = 0; }

All operations (from creation to deletion) for O(1)!

On top of that, you also have elements lying nearby in memory, which allows you to use data locality when iterating through the set.

Of course, such a structure has limitations on the maximum value it can store. Also, if you have complex types, you will still need to call destructors when deleting. But these are minor things.

The folly has quite primitive implementation. They fix the size, which is why the point of customization is lost. But maybe someone will convince them that this is not necessary?)

Of course, you can make arrays not static, but dynamic. And you can even do it not only for numeric types – using hashing! True, not just anyone. Because in fact, we know the entire possible set of values for numeric types, then a similar property should be used when hashing for arbitrary objects so that there are no collisions. This can help perfect hashing.

Sparse Map

OK. Sparse Map is a prank. There’s nothing interesting here. Just instead of the value in dense_ now we keep a couple (key, value)A sparse_ only works with keys.

If you are interested in looking at benchmarks, you can take a look here.