Flexible indexing of elements in a container in C++ and what does Boost.MultiIndex have to do with it


Motivation

Suppose you are a C++ programmer and you need to create a reference. Well, to be more precise, consider one of the stages when you need to map one set to another. In the process of finding a solution, you learn about hash tables, trees, and make the most naive implementation. Then, with the complication of a standard example, you start asking questions:

  1. How to make a class field a key?

  2. What if there are more than 1 rules for indexing (how to map one plurality to another)?

First, consider a class with information about a person, on which we will consider the following examples:

struct Person
{
    Person(std::string name, std::string surname, uint8_t years)
        : name(std::move(name))
        , surname(std::move(surname))
        , years(years)
    {
    }

    std::string name;
    std::string surname;
    uint8_t years;
};

Solution within the standard library

First, let’s look at a naive implementation:

std::vector<Person> persons
{
		Person("Jena", "Kisly", 60),
		Person("Vadya", "Molodec", 40), 
		Person("Onnia", "Solnce", 40)
};

auto jenaIter = std::find_if(persons.begin(), persons.end(),
						[](Person const& person) {return person.name == "Jena";});

The problem with this solution is obvious – too long a search in the container. It is possible to achieve, on average, a constant search time, instead of a linear one (for now, we will omit the nuance that there can be more than 1 people with the same name). To do this, let the name field from now on be the key in the hash table:

std::unordered_map<std::string/*name*/, Person> persons
{
		{ "Jena", Person{"Jena", "Kisly", 60} },
		{ "Vadya", Person{"Vadya", "Molodec", 40} },
		{ "Onnia", Person{"Onnia", "Solnce", 40} }
};

auto jenaIter = persons.find("Jena");

But this solution also has a number of disadvantages:

  1. Duplication of the same value (name) in memory.
    Bad, at least because we are duplicating lines that can take up a lot of space, and, within the scope of the subject area, there can be many such lines. Consequently, the scale of duplication, in the future, will become colossal.

  2. Synchronization problems, follows from item 1.
    If the programmer is told to add the name change feature, then he will need to remember that after changing the Person:: name field, the value of the corresponding key must also be updated. You can, of course, write a special body kit for this, but is it necessary?

  3. The complexity of changing the key follows from paragraph 2.
    To achieve synchronization, you will need to change the key. Generally, make it to C++17 with its unordered_map<T>::extract beautifully it will not turn out, because the key is constant. This can only be done through unordered_map<T>::erase+ unordered_map<T>::insertthe combination of which, it seems, should not lead to an overhead like resizing and / or rehashing.

  4. The solution will not survive if we want to be indexed by some other field (and we will definitely want to). For example, by last name.

After analyzing the list of problems, we can say that some of them can be solved quite simply, if you do not violate DRY. We can DRYit by removing the field Person::namelet the name value be stored only in the container key:

std::unordered_map<std::string/*name*/, Person> persons
{
		{ "Jena", Person{"Kisly", 60} },
		{ "Vadya", Person{"Molodec", 41} },
		{ "Onnia", Person{"Solnce", 40} }
};

auto jenaIter = persons.find("Jena");

The disadvantages of the solution are obvious:

  1. Violation of the principle of least surprise.
    Such a solution looks at least strange, which means that a newly arrived programmer (two weeks on vacation makes us all like that) will need more time to understand the code.

  2. Highly related to the specifics of a particular container. This will make it difficult to switch to another container, if necessary.

  3. The complexity of changing the key has not disappeared anywhere.

  4. The problem of indexing by other fields has not disappeared anywhere.

Then a solution comes to mind through std::unordered_set. For it, you will need to implement a hash function and a comparator by name:

template<typename ClassWithNameField>
class NameHash
{
public:
    size_t operator()(ClassWithNameField const& person) const
    {
        return std::hash<decltype(person.name)>{}(person.name);
    }
};

template<typename ClassWithNameField>
class NamesEqual
{
public:
    bool operator()(ClassWithNameField const& lhs, ClassWithNameField const& rhs) const
    {
        return lhs.name == rhs.name;
    }
};

void main()
{
    std::unordered_set<Person, NameHash<Person>, NamesEqual<Person>> persons
    {
        Person{"Jena", "Kisly", 60},
        Person{"Vadya", "Molodec", 41},
        Person{"Onnia", "Solnce", 40}
    };

    auto jenaIter = persons.find(Person{"Jena", "blah", 0});
}

But this solution is not ideal either. We have to do a lot of things:

  1. Create a dummy Person object to search by name.

    1. In particular, you have to know which of the Person fields in the search is fictitious, although this is a matter of the hash function and the comparator.

  2. Write a separate hash function and comparator from the declaration of the container, which spreads the logic over the code, thereby making it difficult to understand.

  3. There can be more than 1 people with the same name, so set doesn’t fit by definition.

  4. The problem of indexing by other fields has not disappeared anywhere.

Solution with std::set::is_transperent

Looking ahead a little, few people know about it, but starting from C++14, the problems of the previous solution:

  1. [1] and [1.1]related to what you need to know about the semantics of the key.

  2. [4]The associated with indexing on other fields.

can be completely eliminated if you have an ordered container (approx. std::set) and you are using C++14 or higher.

To do this, you need to define an alias in the comparator is_transparent(what type of aliasing is not important). This will allow you to use overloads find( count, upper_bound etc.), which allow you to compare anything with an element in the container.

In our example, it looks like this:

class PersonsComparator
{
public:
    using is_transparent = void;
		
  	// сравнение по имени, если для поиска передали Person
    bool operator()(Person const& lhs, Person const& rhs) const
    {
        return lhs.name < rhs.name;
    }
	
  	// сравнение по годам, если для поиска передали uint8_t
    bool operator()(uint8_t years, Person const& person) const
    {
        return years < person.years;
    }
    bool operator()(Person const& person, uint8_t years) const
    {
        return person.years < years;
    }

  	// сравнение по фамилии, если для поиска передали std::string
    bool operator()(std::string surname, Person const& person) const
    {
        return surname < person.surname;
    }
    bool operator()(Person const& person, std::string surname) const
    {
        return person.surname < surname;
    }
};

void main()
{
    std::set<Person, PersonsComparator> persons
    {
        Person{"Jena", "Kisly", 60},
        Person{"Vadya", "Molodec", 41},
        Person{"Onnia", "Solnce", 40}
    };

    auto jenaIter = persons.find(Person{"Jena", "Kisly", 60});
    auto vadyaIter = persons.find(41);
    auto onniaIter = persons.find("Solnce");
}

In the general case, this solution brilliantly solves only the problem [1] (dummy object). But the problem [1.1] only slightly modified: now we need to know the semantics of the passed magic number, which only the comparator should know about. But apparently it’s as design ¯\_(ツ)_/¯

As for the problem [4](indexing on other fields), since in such a solution the choice of type affects the field that is being searched, and this relationship is logical, i.e. the compiler will not be able to validate it in any way on its own, then it is likely that they will stumble upon a curious situation when find from the previous example, someone wants to pass a first name instead of a last name. In this case, a person will find an error only at runtime, which is much later than we would like. In this example, of course, it will be almost harmless, but the probability of stumbling upon this problem grows with the number of fields of the same type.

Bottom line: STL-based solution

I have not described all possible solutions, but one way or another, all of them will have similar significant problems (not necessarily all):

  • Too much to know about key semantics.

  • Spread comparison logic across different classes, separating it from the container definition.

  • To put up with the fact that being indexed by several fields at once is somewhat problematic.

The solution based on the standard library is not flexible enough. In such a situation, it’s time to either cut the self-propelled gun, or turn to ready-made solutions. One such solution is Boost.MultiIndex.

Boost.MultiIndex based solution

To begin with, I will make a reservation that this solution involves some compromises, primarily related to the declaration of the container, which is incomprehensible at first glance, tied to templates. But, from my point of view, this is the way to boost.

Returning to the Person example, the solution is based on multi_index_container will look like this:

#include <boost/multi_index_container.hpp>
#include <boost/multi_index/member.hpp>
#include <boost/multi_index/hashed_index.hpp>

class Person
{
		// name, surname, years
};

using Persons = boost::multi_index::multi_index_container<
        Person,
        boost::multi_index::indexed_by<
            boost::multi_index::hashed_unique<
                boost::multi_index::tag<struct PersonsByName>,
                boost::multi_index::member<Person, decltype(Person::name), &Person::name>
            >
        >
    >;  

void main()
{
    Persons persons
    {
        Person{"Jena", "Kisly", 60},
        Person{"Vadya", "Molodec", 41},
        Person{"Onnia", "Solnce", 40}
    };

    auto const& personsByName = persons.get<PersonsByName>();
    auto jenaIter = personsByName.find("Jena");
}

Yes, it looks more complicated than with unordered_map. But now our possibilities are almost limitless. For example, now we can easily add the ability to be indexed by last name:

using Persons = boost::multi_index::multi_index_container<
        Person,
        boost::multi_index::indexed_by<
            boost::multi_index::hashed_unique<
                boost::multi_index::tag<struct PersonsByName>,
                boost::multi_index::member<Person, decltype(Person::name), &Person::name>
            >,
            boost::multi_index::hashed_unique<
                boost::multi_index::tag<struct PersonsBySurname>,
                boost::multi_index::member<Person, decltype(Person::surname), &Person::surname>
            >
        >
    >;  

void main()
{
    Persons persons
    {
        Person{"Jena", "Kisly", 60},
        Person{"Vadya", "Molodec", 41},
        Person{"Onnia", "Solnce", 40}
    };

    auto const& personsByName = persons.get<PersonsByName>();
    auto jenaIter = personsByName.find("Jena");

    auto const& personsBySurname = persons.get<PersonsBySurname>();
    auto vadyaIter = personsByName.find("Molodec");
}

In fact, we have added just nothing code (picture below), but solved almost all the significant problems that the previous solutions had.

Among the alternative solutions, only a couple come to mind unordered_map, where in one key will be name, and in the other surname. But such a system of two maps is very inconvenient and sooner or later will lead to errors related to the synchronization of container elements.
Or else you could use unordered_set together with is_transperentas I described before, but this option also has significant drawbacks that I wrote about.

Another plus boost::multi_index::multi_index_container is that it allows us to create and use composite keys quite simply:

#include <boost/multi_index/composite_key.hpp>

class Person
{
  	// ...
};

using Persons = boost::multi_index::multi_index_container<
        Person,
        boost::multi_index::indexed_by<
            boost::multi_index::hashed_non_unique<
                boost::multi_index::tag<struct PersonsByNameAndSurname>,
                boost::multi_index::composite_key<
                    Person,
                    boost::multi_index::member<Person, decltype(Person::name), &Person::name>,
                    boost::multi_index::member<Person, decltype(Person::surname), &Person::surname>
                >
            >
        >
    >;  

void main()
{
    Persons persons
    {
        Person{"Jena", "Kisly", 60},
        Person{"Jena", "Kisly", 10},
        Person{"Vadya", "Molodec", 41},
        Person{"Onnia", "Solnce", 40}
    };

    auto const& personsByName = persons.get<PersonsByNameAndSurname>();
    auto jenaIter = personsByName.find(boost::make_tuple("Jena","Kisly"));
}

Also in this example, we took into account that there are people with the same first and last name using _non_unique index. Now, to find all people with the same first and last name, it is enough to call the method equal_range:

Persons persons
{
		Person{"Jena", "Kisly", 60},
		Person{"Jena", "Kisly", 62},
		Person{"Vadya", "Molodec", 41},
		Person{"Onnia", "Solnce", 40}
};

auto const& personsByName = persons.get<PersonsByNameAndSurname>();
auto jenasItersPair = personsByName.equal_range(boost::make_tuple("Jena","Kisly"));

// Вывод в зависимости от хэш-функции:
// Jena Kisly 62
// Jena Kisly 60
std::for_each(jenasItersPair.first, jenasItersPair.second,
              [](Person const& person)
              {
                std::cout << person.name 
                  << " " << person.surname
                  << " " << (size_t)person.years << std::endl; 
              });

The problem that changing the value of a key is rather ugly (before C++17) is also now resolvable. You need to use the method modify_key:

auto& personsByName = persons.get<PersonsByName>();
auto jenaIter = personsByName.find("Jena");
personsByName.modify_key(jenaIter, [](std::string& name){ name="Jonny"; });

This is not the whole strength of this container. There are other kinds of indexes that make the solution based on this container more flexible. In short, the instructions for choosing them are approximately as follows:

1.a If the elements are in order (through comparison using the <>== operator) – you need ordered_index.
1.b If you need to be indexed by hash – you need hashed_index.

2.a If the element is unique by key, you need _unique index
Syntax: [ordered|hashed]_unique.
If ordered, then it will turn out as in std::setif hashed then as in std::unordered_set.
2.b Element by key is not unique – needed _non_unique index
Syntax: [ordered | hashed]_non_unique.
If ordered, then it will turn out as in std::multisetif hashed then as in std::unordered_multiset.

3. We need random access to the element by index – use random_access index (it will turn out as in std::vector/array).

Conclusion

Boost.MultiIndex is a powerful tool that, for the price of fairly acceptable trade-offs, gives you great opportunities for indexing on a dataset. Thank you for attention!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *