PostgreSQL unique constraint, but NULL conflicts with everything

Posted on 2023-01-24 by Laurenz Albe

creative ideas for an unusual unique constraint — © Laurenz Albe 2022

I have been faced with a request for an unusual unique constraint that puzzled me for a while. Since the solution I came up with is a nice show-case for range data types, I’ll share it with you. Also, it allows me to rant some about NULL, which is a temptation I find hard to resist.

The problem: an unusual unique constraint

We have a table like this:

CREATE TABLE tab (
   a integer NOT NULL,
   b integer
);

Now we want to make sure that the combination of a and b is unique in the following way:

If there is a row where b is NULL, there must not be another row with the same value for a.
If there is no row where b is NULL for a certain a, the table can contain several rows with the same a, as long as the values for b are different.

Let’s illustrate these rules with some sample rows:

-- OK
INSERT INTO tab VALUES (1, NULL);
-- conflicts with the above because of the first rule
INSERT INTO tab VALUES (1, 2);
-- OK, "a" is different
INSERT INTO tab VALUES (5, 2);
-- OK because of the second rule
INSERT INTO tab VALUES (5, 3);
-- conflicts with the above two rows
INSERT INTO tab VALUES (5, NULL);

Ruling out simple solutions

My first reaction was: this won’t be possible with a unique constraint, but perhaps we can create a unique index that enforces these rules. However, I could not find such a solution (perhaps you are smarter than I am and find one).

The next idea is to implement this constraint with a trigger. But such triggers always have a race condition unless you run with SERIALIZABLE isolation level, which doesn’t make this idea appealing either.

The role of NULL in this problem

After thinking about the problem for a while, I realized that it is the unusual role of NULL that makes this problem hard.

The SQL standard defines the null value as a “special value that is used to indicate the absence of any data value”. However, it goes on to state that “the null value is neither equal to any other value nor not equal to any other value — it is unknown whether or not it is equal to any given value”. This sounds slightly contradictory: a value that is not there cannot be equal to a known value, can it? This confusion is at the heart of SQL NULL. If my spouse is NULL, does that mean that I am unmarried? Does it mean that it is unknown whether I am married or not? Does it mean that I am married, but it is unknown to whom? The confusing semantics of NULL in SQL are a symptom of the uncertainty exhibited in the definition and the above example.

I’d say that it is all right to model both missing and unknown values with NULL. But many people abuse NULL for other reasons, for example to represent infinity. This is bad style and will lead to unintuitive and complicated queries. It is also unnecessary, since in PostgreSQL Infinity is a valid value for most numeric and date/time data types.

It seems that similar abuse is at the heart of our problem. Here, NULL is a value that conflicts with everything else. So we could say that NULL means “every possible value” here. This was the key to the solution for me: what would be a better way to represent “every possible value”?

Range data types

As of version 9.2, PostgreSQL has had range data types that represent intervals: all values between a lower and an upper bound. There are also unbounded ranges. Ranges can include the bounds or not: this is represented by a bracket (`[`) or a parenthesis (`(`).

Some examples for ranges:

SELECT '[2022-09-15 00:00:00,2022-09-16 00:00:00)'::tsrange;

                    tsrange                    
═══════════════════════════════════════════════
 ["2022-09-15 00:00:00","2022-09-16 00:00:00")

SELECT '[-10,10]'::int4range;

 int4range 
═══════════
 [-10,11)

SELECT '[0,)'::numrange;

 numrange 
══════════
 [0,)

The second example is not a bug: with integers, a range that goes up to and includes 10 is the same as a range that goes up to 11 while excluding the upper bound. The third example is a range that is unbounded at the upper end.

There are many useful operators for ranges, like the containment operator <@ and the “overlaps” operator &&. PostgreSQL v14 introduced multiranges, which makes working with ranges even easier.

Thinking of our problem, I would say that instead of NULL, we had better use a range that is unbounded on both ends, so that it contains (and conflicts with) all numbers.

The solution: an exclusion constraint instead of a unique constraint

Another PostgreSQL extension to standard SQL are exclusion constraints. Exclusion constraints are an extension of unique constraints: while a unique constraint prevents two rows with equal values for a column, exclusion constraints extend that from equality to other operators. Exclusion constraints are implemented using GiST indexes, which in turn are an extension of B-tree indexes.

In our case, a solution would be to use the “overlaps” operator &&, if we turn regular integers into single-element ranges. But there is a difficulty: we also have to test column a for equality, much like we would in a two-column unique constraint. Now PostgreSQL core does not contain operator classes for “normal” operators like = on the scalar data types like integer. This is because you would normally always use a B-tree index in such a case. But we can create the btree_gist extension to supply the missing operator classes:

CREATE EXTENSION IF NOT EXISTS btree_gist;

Now our constraint looks like this:

ALTER TABLE tab ADD CONSTRAINT null_unique
EXCLUDE USING gist (
   a WITH =,
   int4range(b, b, '[]') WITH &&
);

Let me explain further. If b is not NULL, the int4range function will construct an integer range that only contains b. If b is NULL, the resulting range will be unbounded on both ends. So it is exactly what we need to test for overlap!

Testing the solution for “unique constraint” conflicts

Let’s verify that the constraint works as expected:

-- OK
INSERT INTO tab VALUES (1, NULL);

-- conflicts with the above because of the first rule
INSERT INTO tab VALUES (1, 2);
ERROR:  conflicting key value violates exclusion constraint "null_unique"
DETAIL:  Key (a, int4range(b, b, '[]'::text))=(1, [2,3)) conflicts with existing key (a, int4range(b, b, '[]'::text))=(1, (,)).

-- OK, "a" is different
INSERT INTO tab VALUES (5, 2);

-- OK because of the second rule
INSERT INTO tab VALUES (5, 3);

-- conflicts with the above two rows
INSERT INTO tab VALUES (5, NULL);
ERROR:  conflicting key value violates exclusion constraint "null_unique"
DETAIL:  Key (a, int4range(b, b, '[]'::text))=(5, (,)) conflicts with existing key (a, int4range(b, b, '[]'::text))=(5, [2,3)).

Conclusion

The original problem was hard because it used NULL for something that did not match its semantics. We had to substitute something else for the NULL values before we could apply an exclusion constraint. I think that there are three lessons to be learned here:

Don’t use NULL to represent anything but missing or unknown values.
If a unique constraint with a B-tree index won’t do, consider the more general exclusion constraint with a GiST index.
Range data types are cool.

A unique constraint where NULL conflicts with everything