Tree structures in SQL into one table

The task was to implement the storage and operation of a directory of folders in PostgreSQL. In the process of studying the topic, I came across a large number of materials that solved the problem, but did it without respect did not look concise, violated the transparency of the operations performed, caused blockages, required greater client involvement in the specifics of the work, etc. Therefore, I set a goal: to implement the storage of tree structures (in general form) without using triggers, locks, additional tables (views) and external tools in PostgreSQL (version 16).

Dictionary:
node – a record in the database that is treated as a tree node
parent – a node that is referenced by other nodes
child – a node that links to another node
root – a node that does not have a parent (is not a child)

  • Initialization

    You can implement a tree in a relational database by referring to itself in the table. In this case, each entry will be a separate node that has at most one parent and can have any number of children. A minimal table would look like this:

    create table node
    (
    id        bigint primary key generated always as identity,
    parent_id bigint,
      
    foreign key (parent_id)
        references node (id)
    );

    The type of id and the method of its generation are chosen arbitrarily and do not affect the logic. You can change them at your discretion.

  • Problem #1: Getting the entire tree

    The question is, how to get the entire tree from the database? Do you really have to get the root by id, and get each next level with a separate request? Fortunately, no. To avoid this, add an identifier for each tree (line 5):

    create table node
    (
    id        bigint primary key generated always as identity,
    parent_id bigint,
    tree_id   bigint not null,
      
    foreign key (parent_id)
        references node (id)
    );

    Now the entire tree, no matter what multi-level structure it has, will be obtained in one request. This tree_id can be any unique id: a user, an organization, or any other entity to which the tree belongs. You can even use the root id. However, such a simplification brings with it two problems, which we will dwell on in more detail.

  • Problem #2: Multiple Roots

    In the current scheme, it turns out that each tree_id can have more than one hierarchy of nodes, which will be independent of each other, i.e. have no common root. How to avoid this? We need to make sure that each tree has only one node with parent_id == null. It turns out that every time you save a new node with parent_id == null you need to check the presence of such an entry in the database? Again, no. A unique index will help. For each root (when parent_id == null) there must be a unique tree_id (line 10):

    create table node
    (
    id        bigint primary key generated always as identity,
    parent_id bigint,
    tree_id   bigint not null,
      
    foreign key (parent_id)
        references node (id)
    );
    create unique index on node(tree_id) where parent_id is null;

    Such unique indexes with a condition can guarantee a maximum one-time storage of a certain value for an attribute within one parent entity (in our case, a maximum one-time null value in parent_id within tree_id). We’ll see later how to turn maximum one-time storage into exactly one-time storage.

  • Problem #3: Mixing trees

    In the current diagram, nodes in one tree can refer to nodes in another. Is it really necessary to check the tree_id of the parent with each insertion? And again, no. This is where a composite foreign key comes in handy. Those. now not only the parent_id attribute must be equal to the id of the parent, but also the tree_id of the child must be equal to the tree_id of the parent. To do this, the pair id and tree_id must be unique. In fact, this is already the case (id is unique within the table), but to create a composite foreign key you need to explicitly create a unique index on the pair id, tree_id (lines 7,9,10):

    create table node
    (
    id        bigint primary key generated always as identity,
    parent_id bigint,
    tree_id   bigint not null,
      
    unique (id, tree_id),
    
    foreign key (parent_id, tree_id)
        references node (id, tree_id)
    );
    create unique index on node (tree_id) where parent_id is null;
  • Problem #4: Deleting a node

    When we try to delete a node that has children, we get a foreign key violation error. Do you really have to search for all the daughters and delete everything at once based on the id array? Not necessary. Let's configure the foreign key for cascade deletion (line 11):

    create table node
    (
    id        bigint primary key generated always as identity,
    parent_id bigint,
    tree_id   bigint not null,
      
    unique (id, tree_id),
    
    foreign key (parent_id, tree_id)
        references node (id, tree_id)
        on delete cascade
    );
    create unique index on node (tree_id) where parent_id is null;
  • Problem #5: Loops

    The current scheme allows the formation of cyclic connections, which violates the tree structure, can lead to memory leaks, breaks business logic, etc. In a minimal version it will look like this:

    Here the entry refers to itself. This option is solved by prohibiting equality between id and parent_id within one record (line 7):

    create table node
    (
    id        bigint primary key generated always as identity,
    parent_id bigint,
    tree_id   bigint not null,
      
    constraint node_cycle_check check (parent_id != id),
    
    unique (id, tree_id),
    
    foreign key (parent_id, tree_id)
        references node (id, tree_id)
        on delete cascade
    );
    create unique index on node (tree_id) where parent_id is null;

    But the situation can be more complicated, for example:

    Is it really necessary for the trigger to go through the entire chain of parents in search of the resulting cycle? No, let's try to cheat again. You can check for the presence of a cycle directly within one record. But to do this, you need to additionally store an array of ids of all parents up to the root and check whether it contains the current id. By the way, in this case the check for parent_id ≠ id already looks redundant (lines 6, 8):

    create table node
    (
    id         bigint primary key generated always as identity,
    parent_id  bigint,
    tree_id    bigint not null,
    parent_ids bigint[],
      
    constraint node_cycle_check check (not (parent_ids @> array[id])),
    
    unique (id, tree_id),
    
    foreign key (parent_id, tree_id)
        references node (id, tree_id)
        on delete cascade
    );
    create unique index on node (tree_id) where parent_id is null;

    I anticipate many questions for this step, but please be patient, all the answers will follow. The (@>) operator tests whether an array contains another array. Note that parent_ids is nullable. The attribute will only accept null for the root. It would be possible to make it mandatory, and set the root to an empty array (this will work now). However, looking ahead, I’ll say that it’s better to stick with nullable.

  • Problem #6: Consistency of parent_ids

    Now you can pass anything to parent_ids. How can we ensure that this attribute reflects the actual state of the database? Will you still have to run through your parents as a trigger? No. This is where foreign key expansion comes in handy again. It is necessary to make sure that the attribute is not set arbitrarily, but depends on the parent record. But what should I refer him to? Obviously, the parent_ids of the parent will always differ from the parent_ids of the child by one element. Then we’ll cheat again and create another attribute, which will consist of parent_ids and the current id. And parent_ids will refer to it. To do this, create an attribute, add it to the index and expand the foreign key (lines 7, 11, 13, 14):

    create table node
    (
    id                 bigint primary key generated always as identity,
    parent_id          bigint,
    tree_id            bigint   not null,
    parent_ids         bigint[],
    parent_ids_with_it bigint[] not null,
      
    constraint node_cycle_check check (not (parent_ids @> array [id])),
    
    unique (id, tree_id, parent_ids_with_it),
    
    foreign key (parent_id, tree_id, parent_ids)
        references node (id, tree_id, parent_ids_with_it)
        on delete cascade
    );
    create unique index on node (tree_id) where parent_id is null;

    It is worth noting that this attribute is not nullable, because the minimum size of this array == 1 (for the root).

  • Problem #7: Consistency of parent_ids_with_it

    Now parent_ids_with_it is set arbitrarily. What to do with it? Maybe we can go over the parents here? In no case. It will be generated automatically based on parent_ids. We most often encounter “generated always as” when generating id, but it will come in handy here too. To do this, you need to add the current id to the parent_ids array (line 7):

    create table node
    (
    id                 bigint primary key generated always as identity,
    parent_id          bigint,
    tree_id            bigint   not null,
    parent_ids         bigint[],
    parent_ids_with_it bigint[] not null generated always as (parent_ids || id) stored,
    
    constraint node_cycle_check check (not (parent_ids @> array [id])),
    
    unique (id, tree_id, parent_ids_with_it),
    
    foreign key (parent_id, tree_id, parent_ids)
        references node (id, tree_id, parent_ids_with_it)
        on delete cascade
    );
    create unique index on node (tree_id) where parent_id is null;

    The generation function (parent_ids || id) adds id to the end of the array. Therefore, this attribute cannot be set manually, it will always be calculated automatically. In the case where parent_ids == null (for the root), the operator will return an array with one element (current id).

    It is worth paying attention to stored. This parameter determines whether the attribute will be stored or virtual (that is, calculated each time it is accessed). For our purposes, the second option would be suitable so as not to take up extra disk space, but PostgreSQL still (up to version 16 inclusive) does not support virtual attributes, so the column is explicitly defined as saved.

  • Problem #8: Parent_ids are fundamentally correct

    Now, when saving a new record, you need to set parent_ids, which will be equal to parent_ids_with_it of the parent. All this works great for daughters. What to do with the root? After all, its parent_id == null, which means its foreign key does not work (does not look for a reference). Thus, you can pass any array to parent_ids, which will then be reflected on all children. Those. you need to make sure that both parent_id and parent_ids are null for the root (line 10):

    create table node
    (
    id                 bigint primary key generated always as identity,
    parent_id          bigint,
    tree_id            bigint   not null,
    parent_ids         bigint[],
    parent_ids_with_it bigint[] not null generated always as (parent_ids || id) stored,
    
    constraint node_cycle_check check (not (parent_ids @> array [id])),
    constraint node_root_check check ((parent_ids is null) = (parent_id is null)),
    
    unique (id, tree_id, parent_ids_with_it),
    
    foreign key (parent_id, tree_id, parent_ids)
        references node (id, tree_id, parent_ids_with_it)
        on delete cascade
    );
    create unique index on node (tree_id) where parent_id is null;
  • Problem #9: Relevance of parent_ids

    How to keep parent_ids up to date? For example, we moved a node (i.e. we changed its parent_id and parent_ids). How to update this attribute for all his daughters? This is where the cascading change will do the work for us. This foreign key setting allows the key itself to automatically change following a change in the attribute(s) the key refers to (line 17):

    create table node
    (
    id                 bigint primary key generated always as identity,
    parent_id          bigint,
    tree_id            bigint   not null,
    parent_ids         bigint[],
    parent_ids_with_it bigint[] not null generated always as (parent_ids || id) stored,
    
    constraint node_cycle_check check (not (parent_ids @> array [id])),
    constraint node_root_check check ((parent_ids is null) = (parent_id is null)),
    
    unique (id, tree_id, parent_ids_with_it),
    
    foreign key (parent_id, tree_id, parent_ids)
        references node (id, tree_id, parent_ids_with_it)
        on delete cascade
        on update cascade
    
    );
    create unique index on node (tree_id) where parent_id is null;

    Now it is enough to change the parent of a certain node (you can even move it to another tree) and all changes will be applied to all its daughters.

    Be careful! The cascade update setting can only be enabled when there is a check for cycles. What will happen otherwise? Let's say there are 2 records: a root and its daughter:

    id

    parent_id

    tree_id

    parent_ids

    parent_ids_with_it

    1

    null

    1

    null

    {1}

    2

    1

    1

    {1}

    {1,2}

    We refer the root to our daughter:

    update node set parent_id = 2, parent_ids = [1,2] where id = 1;

    At this moment, the database will recalculate parent_ids_with_it at the root, and instead of {1} it will be {1,2,1}. Then the daughter’s parent_ids will be updated to {1,2,1}, and parent_ids_with_it will be recalculated to {1,2,1,2}. Which will force the parent_ids at the root to be updated to {1,2,1,2} and the parent_ids_with_it to be recalculated to {1,2,1,2,1}… As you can see, the circle is closed, and the database makes these updates endlessly. The request hangs until it is forced to abort.

  • Problem #10: Exactly one root per tree

    Let's return to the question of the number of roots for one tree. Previously, it was possible to achieve a maximum of one-time root storage. But how can we guarantee exactly one root? In an unexpected way, this has already been done. The fact is that a tree (i.e. within the tree_id) may not have a root (i.e. there may be no node with parent_id==null) only in two cases: the root refers to a node of another tree or the root refers to one of their daughters (including themselves). Both of these scenarios are no longer possible in the current scheme. We can state that if there is a tree in the database, it has exactly 1 root.

  • Optimization

    So, we have a ready-made option for setting up a table for storing trees. You can take it and use it. However, there are a couple more optimizations.

    1. If you look closely at the diagram, you will notice that after the appearance of parent_ids, parent_id itself is redundant. Now it does not carry any information that would not be in parent_ids, and besides, it inflates the index, foreign key and requires additional checking for null for the root. But is it possible to get rid of it painlessly? Quite. Now the composite index will contain 2 attributes: tree_id and parent_ids_with_it. Is it possible to guarantee the uniqueness of this pair? Yes. The last array element in the second attribute will always correspond to the id of the current entry. Since id is unique, the last element of the array will never be repeated. This means the array is unique. This means the couple is unique. After removing parent_id, the schema looks like this:

      create table node
      (
      id                 bigint primary key generated always as identity,
      tree_id            bigint   not null,
      parent_ids         bigint[],
      parent_ids_with_it bigint[] not null generated always as (parent_ids || id) stored,
      
      constraint node_cycle_check check (not (parent_ids @> array [id])),
      
      unique (tree_id, parent_ids_with_it),
      
      foreign key (tree_id, parent_ids)
          references node (tree_id, parent_ids_with_it)
          on delete cascade
          on update cascade
      
      );
      create unique index on node (tree_id) where parent_ids is null;

      Deleted:
      – parent_id attribute
      – check node_root_check
      – parent_id in index and foreign key
      – the condition in the unique index on tree_id has been changed – now the check for null occurs immediately against the parent_ids array.

      Returning to the question of whether to make parent_ids nullable for the root or set an empty array – in this option, parent_ids must be null for the root, and not an empty array, because otherwise the DB will look for a parent with an empty array in parent_ids_with_it.

    2. But that's not all. In some scenarios, you can simplify the design even further. The fact is that the tree_id attribute was introduced for two things: to “attach” a tree to another entity (user, organization, etc.) and to query the entire tree with one request using tree_id. In the first case, we still need tree_id. But let’s imagine that we have the task of storing trees in an impersonal form and returning them by root id. Then you can do without tree_id:

      create table node
      (
      id                 bigint primary key generated always as identity,
      parent_ids         bigint[],
      parent_ids_with_it bigint[] not null generated always as (parent_ids || id) stored,
      
      constraint node_cycle_check check (not (parent_ids @> array [id])),
      
      unique (parent_ids_with_it),
      
      foreign key (parent_ids)
          references node (parent_ids_with_it)
          on delete cascade
          on update cascade
      );

      Deleted:
      – tree_id attribute
      – tree_id in index and foreign key
      – unique index on tree_id

      How can you get the entire tree in this case? Through the first element in the array parent_ids_with_it. It is he who is always equal to the root of his tree. A request for a tree by root id will look like this:

      select * from node where parent_ids_with_it[1] = :id;

      Please note that element numbering starts from 1, not 0.

      The most beautiful implementation. However, there are a number of tasks in which the presence of an external identifier is critical, but more on that later.

  • Nice bonuses:

    Having parent_ids_with_it allows you to do additional checks and queries that would not be possible without it.

    1. Maximum nesting:

      You can limit the maximum nesting in a tree by checking the size of the parent_ids_with_it array:

      constraint node_depth_check check (cardinality(parent_ids_with_it) <= 100)
    2. Getting the full path from root to node (vertical search):

      To get all nodes from the root to a given node, one nested query is sufficient:

      select * from node where id in (select parent_ids_with_it from node where id = :node_id);
    3. Getting all nodes at a certain level (horizontal search):

      You can get all nodes at a certain level (even with different parents) by checking the size of parent_ids_with_it:

      select * from node where tree_id = :tree_id and cardinality(parent_ids_with_it) = :depth;
      select * from node where parent_ids_with_it[1] = :root_id and cardinality(parent_ids_with_it) = :depth;
    4. Getting a subtree:

      Let's assume that the trees in our system are massive, and querying the entire structure at once is not an option. You can make queries by logical chunks, i.e. find only that part of the tree that starts from a specific node:

      select * from node where parent_ids_with_it @> array [:node_id];
    5. Getting a depth-limited subtree:

      But what if the trees are so large that a request for a subtree no longer helps? The requested depth must be limited. To do this, use the already familiar check for the size of parent_ids_with_it:

      1. For the root:

        select * from node where tree_id = :tree_id and cardinality(parent_ids_with_it) <= :depth;
        select * from node where parent_ids_with_it[1] = :root_id and cardinality(parent_ids_with_it) <= :depth;
      2. For an arbitrary node:

        select * from node where parent_ids_with_it @> array [:node_id] and cardinality(parent_ids_with_it) <= cardinality((select parent_ids_with_it from node where id = :node_id)) + :depth - 1;
  • Similar Posts

    Leave a Reply

    Your email address will not be published. Required fields are marked *