POST DIRECTORY
Category software development

Self-referential Tables In SQL Are Tricky

One of the applications I work on exposes flattened reporting data as a SQL view. The systems that consume this data only support fetching it via direct database access, so I had to implement it as a read only view. The view itself is fairly complex, combining relational and summary data from 15 different tables.

The most challenging aspect of writing this was working with a hierarchical, self-referential table. Each row of reporting data in the view belonged to a node in this hierarchy, and the view itself needed to include data from the associated root node.

To make this a little more concrete, imagine you had an arbitrarily deep nested category tree with products in each category. How would you write a SQL query over all products where each row included the product’s name and its top-level category name? I’ll refer to this simplified example for the rest of the post.

Here’s the schema for categories and products with the usual indexes and foreign keys.

-- schema.sql for postgresql
CREATE TABLE categories (
  id        serial primary key,
  name      character varying NOT NULL,
  parent_id bigint
);
CREATE INDEX ON categories(parent_id);

CREATE TABLE products (
  id          serial primary key,
  name        character varying NOT NULL,
  category_id bigint NOT NULL
);
CREATE INDEX ON products(category_id);

ALTER TABLE ONLY categories
    ADD CONSTRAINT category_parent_fk FOREIGN KEY (parent_id) REFERENCES categories(id);
ALTER TABLE ONLY products
    ADD CONSTRAINT product_category_fk FOREIGN KEY (category_id) REFERENCES categories(id);

Since the category hierarchy can be arbitrarily deep, no fixed set of SQL query or join operations can solve what is a recursive problem. You need looping or recursion to find your way to the root category. (This is not strictly true if you are willing to accept some de-normalization of the self-referential tables. I will discuss one of these techniques and its limitations next.)

Querying Hierarchical Data

Working with hierarchical data in SQL can be hard. I played with several methods for implementing this part of the view: using a “materialized path”, looking at the “nested sets model”, and implementing a stored procedure to do the looping and recursion I needed.

In the end these all required adding moving parts to the system just to support one reporting view. I’ll briefly touch on a simplified version of “materialized path” to illustrate this point, but the “nested sets model” and stored procedures have similar drawbacks.

Recursive Common Table Expressions (CTEs) turned out to be the perfect solution. They add powerful recursion capabilities to the SQL toolkit and are well supported by most relational databases. As applied to my problem they performed exactly the recursion I needed and added no additional moving parts.

Can We Keep It Simple?

The first obvious solution was to de-normalize the self-referential category table by adding a root_id column containing the primary key for the associated root category. This would allow the SQL query to simply join the product’s category directly to its root category.

The downside to this is the data de-normalization, we are duplicating data implicit in the hierarchy into an explicit column. The root_id column would need to be written to each new row on insert (after traversing the parent categories to find the root first). And moving a category around in the hierarchy could require updating its root_id (and traversing all of the children in its sub-tree to update their root_ids).

This implementation is really a super-simplified form of the “materialized path” technique. The full version has some nice features like being able to efficiently select whole sub-trees in the hierarchy or being able to quickly select a category’s full set of parents. There are also forms that include efficient category sibling ordering for free. Again the downside is the de-normalization involved, but if you can manage it “materialized path” can be a useful technique.

For my purposes even this simplified version of “materialized path” seemed like overkill. I would have to modify the schema to add a root_id column, write scripts to update existing categories, and implement code to keep the root_id column in sync. If possible I wanted to avoid adding all of these moving parts just to support one reporting view.

What Is A Recursive CTE?

A Common Table Expression is a named result set that can be used anywhere a normal table could in the SELECT statement that follows it. In addition, there can be more than one CTE defined for use in the SELECT. In its non-recursive usage a CTE behaves like a named sub-select. The syntax looks like this:

-- Example with two CTEs:
-- CTE named 'one' with column 'number_one' and having one row with value 1
-- CTE named 'two' with column 'number_two' and having one row with value 2
-- SELECT one result row containing data from the row in 'one' and the row in 'two'
WITH one AS (
       SELECT 1 AS number_one
     ),
     two AS (
       SELECT 2 AS number_two
     )
SELECT * FROM one, two;
 number_one | number_two
------------+------------
          1 |          2
(1 row)

Far more interesting and powerful is the recursive CTE. It has the same features as the non-recursive version, but specifies two SELECT statements separated by UNION or UNION ALL.

When the recursive CTE query runs, the first SELECT generates one or more seed rows which are added to the result set. For each of these seed rows the second SELECT is run and its rows are added to the result set. This continues recursively so the second SELECT is run against all the rows from the last iteration and the new rows are added to the result set. The recursion ends when no more rows are returned from the second SELECT.

If the two SELECTs are separated by UNION ALL then all the generated rows are included in the final result set. If they are separated by UNION then any duplicate rows are eliminated from the final result set.

Here is a very simple example that counts to 10:

-- Example with recursive CTE
-- Begin with one row with column named 'n' and value 1
-- Use the previous row and add a new row with column named 'n' and value 2
-- Use the previous row and add a new row with column named 'n' and value 3
-- recursion continues...
-- When n = 10 no rows are returned from the second SELECT and recursion ends
--
-- All the generated rows will be in the CTE named 'counter' for use in the SELECT
WITH RECURSIVE counter AS (
  SELECT 1 as n

  UNION ALL

  SELECT n + 1 FROM counter WHERE n < 10
)
SELECT * from counter;
 n
----
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
(10 rows)

Recursive CTEs are supported by all the major relational databases. In the examples we will use PostgreSQL, which requires the RECURSIVE keyword in recursive CTE definitions but it is optional for other databases.

Recursive CTE Examples

Here is some sample data in CSV format for the remaining examples. This defines a simple hierarchy with two root categories, each with its own sub-tree of categories.

id,name,parent_id
1,Root A,
2,Root B,
3,Child A1,1
4,Child A2,1
5,Child B1,2
6,Child B2,2
7,Grandchild A1a,3
8,Grandchild A1b,3

It looks something like this:

Root A --> Child A1 --> Grandchild A1a
       |            -> Grandchild A1b
       -> Child A2

Root B --> Child B1
       -> Child B2

Find All Categories In A Sub-Tree

The first example will let us query all of the categories in the sub-tree staring with the Child A1 category. I’ll also add a column to represent the relative depth of the sub-tree members. As is typical in real recursive CTE examples, the second SELECT refers to the previously generated results using the name of the CTE itself, in this case sub_tree.

WITH RECURSIVE sub_tree AS (
  SELECT id, name, 1 AS relative_depth
  FROM categories
  WHERE name = 'Child A1'

  UNION ALL

  SELECT cat.id, cat.name, st.relative_depth + 1
  FROM categories cat, sub_tree st
  WHERE cat.parent_id = st.id
)
SELECT * FROM sub_tree;
 id |      name      | relative_depth
----+----------------+----------------
  3 | Child A1       |              1
  7 | Grandchild A1a |              2
  8 | Grandchild A1b |              2
(3 rows)

Find A Category’s Parents

Let’s try traversing up through the parents of a category instead. In this case we’ll add a negative relative depth for the parents. Let’s start at the leaf Grandchild A1b this time.

WITH RECURSIVE parents AS (
  SELECT id, name, parent_id, 0 AS relative_depth
  FROM categories
  WHERE name = 'Grandchild A1b'

  UNION ALL

  SELECT cat.id, cat.name, cat.parent_id, p.relative_depth - 1
  FROM categories cat, parents p
  WHERE cat.id = p.parent_id
)
SELECT id, name, relative_depth FROM parents;
 id |      name      | relative_depth
----+----------------+----------------
  8 | Grandchild A1b |              0
  3 | Child A1       |             -1
  1 | Root A         |             -2
(3 rows)

The Original Product and Root Category Query

Finally we can get back to our original example. For every product I want to show its root category’s name. We will add a couple example products, one in Grandchild A1b and one in Child B1 as in this CSV formatted list.

id,name,category_id
1,Product in A1b,8
2,Product in B1,5

Our CTE will query the root categories (parent_id IS NULL) and then recursively query the children passing the root category name along to be re-used on every child in its sub-tree. Here is just the recursive CTE in action, as you can see it correctly finds the root name for each category in the categories table.

WITH RECURSIVE categories_with_roots AS (
  SELECT id, parent_id, name, name as root_name
  FROM categories
  WHERE parent_id IS NULL

  UNION ALL

  SELECT cat.id, cat.parent_id, cat.name, cwr.root_name
  FROM categories cat, categories_with_roots cwr
  WHERE cat.parent_id = cwr.id
)
SELECT name, root_name FROM categories_with_roots;
      name      | root_name
----------------+-----------
 Root A         | Root A
 Root B         | Root B
 Child A1       | Root A
 Child A2       | Root A
 Child B1       | Root B
 Child B2       | Root B
 Grandchild A1a | Root A
 Grandchild A1b | Root A

Finally, here is the full query of all products and their root category names. The main SELECT just pulls categories_with_roots into the FROM clause and matches them up with products in the WHERE clause.

WITH RECURSIVE categories_with_roots AS (
  SELECT id, parent_id, name, name as root_name
  FROM categories
  WHERE parent_id IS NULL

  UNION ALL

  SELECT cat.id, cat.parent_id, cat.name, cwr.root_name
  FROM categories cat, categories_with_roots cwr
  WHERE cat.parent_id = cwr.id
)
SELECT p.name AS product_name, cwr.root_name
FROM products p, categories_with_roots cwr
WHERE p.category_id = cwr.id;
  product_name  | root_name
----------------+-----------
 Product in B1  | Root B
 Product in A1b | Root A
(2 rows)

You could also use a JOIN to bring the categories_with_roots CTE into the query if you prefer.

WITH RECURSIVE categories_with_roots AS (
  SELECT id, parent_id, name, name as root_name
  FROM categories
  WHERE parent_id IS NULL

  UNION ALL

  SELECT cat.id, cat.parent_id, cat.name, cwr.root_name
  FROM categories cat, categories_with_roots cwr
  WHERE cat.parent_id = cwr.id
)
SELECT products.name AS product_name, categories_with_roots.root_name
FROM products
JOIN categories_with_roots ON categories_with_roots.id = products.category_id;
  product_name  | root_name
----------------+-----------
 Product in B1  | Root B
 Product in A1b | Root A
(2 rows)

How Does This Keep Things Simple?

We now have a simple way to traverse arbitrarily deep hierarchical data very flexibly without introducing the overhead of keeping de-normalized data in a correct state. We didn’t even have to define stored procedures to do the recursive traversal we needed.

The recursion is just another part of the normal SQL query, perfect for this use case. In the future whenever I have to access self-referential, tree-like data I’ll reach for recursive CTEs first.