Store Trees As Materialized Paths

Currently used approaches for trees in databases like adjacency lists or the nested set model are too complex. Storing the full path with every row simplifies tree querying and management.

You can use the lesser-known materialized path model as a more straightforward approach to store trees than the more complex known nested set and adjacency list models. Every row merely stores the materialized path within the tree to itself, making queries for tree searching relatively easy. With PostgreSQL, you'll get a wide range of querying and manipulation functionality from the label tree extension. While for MySQL, you'll have to fallback to text-searching functionalities.

Usage

MySQL

CREATE TABLE tree (id bigint, path varchar(255), ...);
INSERT INTO tree (path) VALUES ('Food');
INSERT INTO tree (path) VALUES ('Food.Fruit');
INSERT INTO tree (path) VALUES ('Food.Fruit.Cherry');
INSERT INTO tree (path) VALUES ('Food.Fruit.Banana');
INSERT INTO tree (path) VALUES ('Food.Meat');
INSERT INTO tree (path) VALUES ('Food.Meat.Beaf');
INSERT INTO tree (path) VALUES ('Food.Meat.Pork');

-- All children (Food.Fruit.Banana, Food.Fruit.Cherry):
SELECT * FROM tree WHERE path like 'Food.Fruit.%';

-- All parents (Food, Food.Fruit):
SELECT * FROM tree WHERE path IN('Food', 'Food.Fruit');

PostgreSQL

CREATE EXTENSION ltree;
CREATE TABLE tree (id bigint, path ltree, ...);
INSERT INTO tree (path) VALUES ('Food');
INSERT INTO tree (path) VALUES ('Food.Fruit');
INSERT INTO tree (path) VALUES ('Food.Fruit.Cherry');
INSERT INTO tree (path) VALUES ('Food.Fruit.Banana');
INSERT INTO tree (path) VALUES ('Food.Meat');
INSERT INTO tree (path) VALUES ('Food.Meat.Beaf');
INSERT INTO tree (path) VALUES ('Food.Meat.Pork');

-- All children (Food.Fruit.Banana, Food.Fruit.Cherry):
SELECT * FROM tree WHERE path ~ 'Food.Fruit.*{1,}';

-- All parents (Food, Food.Fruit):
SELECT * FROM tree WHERE path @> subpath('Food.Fruit.Banana', 0, -1);

Detailed Explanation

Many apps have to store trees in some form in their database, like deeply nested comments, project hierarchies and so on. You have to choose between using adjacency lists or the nested set mode.

With the adjacency list approach, the table has a parent_id column with each row pointing to its parent. Building and modifying the tree is effortless: If you want to move a complete sub-tree to a different parent, you only have to update the parent_id of a single row. Due to that simplicity, loops within a tree are created quickly, which shouldn't happen. You have to be very careful to avoid such problems. While building the tree is easy, searching for all ancestors or children of a row is complicated: You can either execute one query for every new tree level you find (n+1 queries) or use recursive CTEs to do everything in a simple operation. They are hard to write and you have to incorporate cycle detection to prevent infinite long-running queries.

The nested set model solves the former problems of infinite loops and complex querying: Instead of a parent_id relationship, each row stores a left and right number to indicate the rows within the tree that are children. Finding all children or parents is efficient and easy by comparing the left-right ranges of rows. But the simplicity is achieved at the cost of one major disadvantage: Adding or removing rows within that tree is much more complicated. Any change within the tree results in updating countless rows to ensure the nested model semantics. You often have to rely on libraries to do that tree management for you because some optimizations can reduce the overhead but are quite complicated to implement.

The much simpler materialized path model solves most of the former problems: Tree building and management are simplified by each row storing its entire path from the root node. With this nice trick, adding and removing new rows into the tree is simple and done without rewriting many rows. Loops are impossible too and you can now also get the depth of a row within a tree. Querying for parents and children is much easier because the materialized path prevents complexities like recursive CTEs.

Storing trees in a database is not natively supported. Hence, approaches such as the nested model had to be invented for an efficient tree search. The materialized path model is accordingly not widely supported either. PostgreSQL has native backing for it with index-supported tree queries and auxiliary functions for path inspections and modifications. With every other database, you have to rely on the support for strings and their indexing. As shown in the MySQL example, trailing wildcard searches and lookups of specific tree paths also work great.

Additional Resources

Author Tobias Petry

Tobias Petry

SQL for Devs Founder


Are you interested to learn more?

Be notified on future content. Never spam.