JSON field constraints

Posted on Mar 9, 2016 at 13:59. Tags: postgresql, json, indexing, constraints, nosql.

After giving my presentation at ConFoo this year, I had some discussions with a few people about the ability to put constraints on JSON data, and whether any of the advanced PostgreSQL constraints work for that. Or in short, can we get the benefits from both SQL and NoSQL at the same time?

My general response to questions like this when it comes to PostgreSQL is "if you think there's a chance it works, it probably does", and it turns out that applies in this case as well.

For things like UNIQUE keys and CHECK constraints it's fairly trivial, but there are also things like EXCLUSION constraints where there are some special constructs that need to be considered.

Other than the technical side of things, it's of course also a question of "should we do this". The more constraints that are added to the JSON data, the less "schemaless" it is. On the other hand, other databases that have schemaless/dynamic schema as their main selling points, but still require per-key indexes and constraints (unlike PostgreSQL where JSONB is actually schemaless even when indexed).

Anyway, back on topic. Keys and constraints on JSON data.

In PostgreSQL, keys and constraints can be defined on both regular columns and directly on any expression, as long as that expression is immutable (meaning that the output is only ever dependent on the input, and not on any outside state). And this functionality works very well with JSONB as well.

So let's start with a standard JSONB table:

postgres=# CREATE TABLE jsontable (j jsonb NOT NULL);
CREATE TABLE

postgres=# CREATE INDEX j_idx ON jsontable USING gin(j jsonb_path_ops);
CREATE INDEX

Of course, declaring a table like this is very seldom a good idea in reality - a single table with just a JSONB field. You probably know more about your data than that, so there will be other fields in the table than just the JSONB field. But this table will suffice for our example.

A standard gin index using jsonb_path_ops is how we get fully schemaless indexing in jsonb with maximum performance. We're not actually going to use this index in the examples below this time, but in real deployments it's likely one of the main reasons to use JSONB in the first place.

To illustrate the constraints, let's add some data representing some sort of bookings. Yes, this would be much better represented as relational, but for the sake of example we'll use JSON with a semi-fixed schema. We'll also use a uuid in the JSON data as some sort of key, as this is fairly common in these scenarios.

postgres=# INSERT INTO jsontable (j) VALUES ($${
  "uuid": "4e9cf085-09a5-4b4f-bc99-bde2d2d51f41",
  "start": "2015-03-08 10:00",
  "end": "2015-03-08 11:00",
  "title": "test"
}$$);
INSERT 0 1

The first thing we might want to consider in a scenario like this is the ability to make sure our uuid field is unique. uuid are supposed to be globally unique, but experience suggests that's not always the case, so we want a constraint to ensure we don't end up with duplicates. Doing this is really simple:

postgres=# CREATE UNIQUE INDEX j_uuid_idx ON jsontable(((j->>'uuid')::uuid));
CREATE INDEX

This creates an index (backed by a unique btree index) on the extracted value of the uuid field. This index can be used both for key based lookup, and for enforcing the uniqueness of the key. the j->>'uuid' syntax extracts the value of the uuid key as a text string, and then we use the ::uuid syntax to cast it to the built-in uuid datatype.

It is also possible to create the constraint directly on the text value, but doing it based on the uuid datatype will me much more efficient since it's treated internally as a 128-bit integer value. This gives both a smaller index, and faster access as it doesn't have to consider things like locales.

Once this constraint has been added, it's no longer possible to insert JSON values with duplicate uuids:

postgres=# INSERT INTO jsontable (j) VALUES ($${"uuid": "4e9cf085-09a5-4b4f-bc99-bde2d2d51f41", "start": "2015-03-08 11:00", "end": "2015-03-08 12:00", "title": "test2"}$$);
ERROR:  duplicate key value violates unique constraint "j_uuid_idx"
DETAIL:  Key (((j ->> 'uuid'::text)::uuid))=(4e9cf085-09a5-4b4f-bc99-bde2d2d51f41) already exists.

There is one remaining problem with this constraint - it does not enforce that the attribute actually exists. We can insert documents in this table that simply do not have the uuid key at all. This is because the extraction operator ->> will return NULL in this case, which does not cause unique violations (because NULL does not equal NULL). If this is a problem, which it usually is, then we can create a CHECK constraint to enforce this:

postgres=# ALTER TABLE jsontable ADD CONSTRAINT uuid_must_exist CHECK (j ? 'uuid');
ALTER TABLE

With this constraint in place, it is no longer possible to insert documents lacking the uuid value, and the previously created unique index enforces that the value that's included is actually unique. The cast to uuid ensures that the format of the value is correct. So with this set of indexes and constraints, we have replicated the functionality of a traditional column specified as uuid NOT NULL UNIQUE.

So what about other more advanced constraints. In particular, what about exclusion constraints? This of course explains why I choose a table of bookings as the example - this is probably the most used example for exclusion constraints. The basic idea is - can we define a constraint so that we can not have any overlapping bookings in the system. For normal relational tables, this is trivial - create a normal exclusion constraint, and you're done.

For JSON data we can almost apply the same method as above, which is create a constraint based on the extraction function. There is just one problem with this - we can only create constraints on expressions that are immutable, and casting a text to timestamp is not immutable.

The reason a cast to timestamp is not immutable, is that we can cast values that are dependent on outside values. For example:

postgres=# SELECT 'today'::timestamp;
      timestamp      
---------------------
 2016-03-08 00:00:00
(1 row)

which is a good example of not being immutable as it will change value every day. And values that are in an index can never change (unless the value is explicitly UPDATEd, of course).

If we know that our data does not have this type of data in it, we might know that the actual usage of the function is immutable. If we do know this, we can create a thin wrapper function that makes it immutable:

postgres=# CREATE FUNCTION immutable_tstamp(t text) RETURNS timestamp LANGUAGE sql IMMUTABLE AS $$SELECT t::timestamptz AT TIME ZONE 'UTC'$$;
CREATE FUNCTION

This function does an immutable cast to timestamp hard coded to the timezone being UTC. It is probably a good idea to also add a CHECK constraint to ensure that the data doesn't actually contain things that are not immutable, but I'll leave that as an exercise for the reader.

If we combine this function with the function tsrange(), we can create an expression-based exclusion constraint that makes sure there are no overlapping values in the documents. We do this by extracting the start and end times individually using the just created immutable cast function, and then call tsrange to construct an actual timestamp range to it, passing this to the exclusion constraint with the overlaps operator (&&).

postgres=# ALTER TABLE jsontable
  ADD CONSTRAINT overlapping_times
  EXCLUDE USING gist(
   tsrange(
    immutable_tstamp(j->>'start'),
    immutable_tstamp(j->>'end')
   ) WITH &&
);
ALTER TABLE

And if we now try to insert a value with an overlapping time range, it will be refused:

postgres=# insert into jsontable (j) values ($${
  "uuid": "4e9cf085-09a5-4b4f-bc99-bde2d2d51f43",
  "start": "2015-03-08 10:30",
  "end": "2015-03-08 11:30",
  "title": "test"
}$$);
ERROR:  conflicting key value violates exclusion constraint "overlapping_times"
DETAIL:  Key (tsrange(immutable_tstamp(j ->> 'start'::text), immutable_tstamp(j ->> 'end'::text)))=(["2015-03-08 10:30:00","2015-03-08 11:30:00")) conflicts with existing key (tsrange(immutable_tstamp(j ->> 'start'::text), immutable_tstamp(j ->> 'end'::text)))=(["2015-03-08 11:00:00","2015-03-08 12:00:00")).

With the function and this constraint, we have replicated the functionality of a traditional exclusion constraint specified as EXCLUDE USING gist(r WITH &&) if we had our range in the column r.

So to answer the original question "can we get the benefits from both SQL and NoSQL at the same time", the answer is obviously YES. As long as you use a database which actually has both capabilities, namely PostgreSQL.

It is worth noticing that with data like in this example where the schema is actually known, the system will be much faster and more efficient if those fields were stored in relational form. As always, the right tool for the job applies, so JSONB should only be used when the schema is actually at least semi-dynamic. But being able to declare constraints on parts of those schemas can still be extremely valuable, even if it's not quite as speedy as doing it on relational data. The whole point of dynamic schema is flexibility, after all.

Comments

Very interesting. Have your cake and eat it to. Now on to doing some experimenting. One editing note, pretty sure this:

"The reason a cast to timestamp is not immutable, is that we can cast values that are dependent on outside values. For example:

postgres=# SELECT 'today'::timestamp; timestamp

2016-03-08 00:00:00 (1 row)

which is a good example of immutable as it will change value every day. "

should be:

".... is a good example of mutable ..."

Posted on Mar 9, 2016 at 17:55 by Adrian Klaver.

Good point, I've updated the text. Thanks!

Posted on Mar 9, 2016 at 19:36 by Magnus.

Add comment

New comments can no longer be posted on this entry.