JSON in PostgreSQL: how to use it right

JSON, the lastest step in database technology — © Laurenz Albe 2021

Table of Contents

The comprehensive JSON support in PostgreSQL is one of its best-loved features. Many people – particularly those with a stronger background in Javascript programming than in relational databases – use it extensively. However, my experience is that the vast majority of people don't use it correctly. That causes problems and unhappiness in the long run.

In this article, I will try to point out good and bad uses of JSON in PostgreSQL, and provide you with guidelines that you can follow.

A bad example

This data model exemplifies everything that you can do wrong:

/* this table is fine */
CREATE TABLE people (
   id bigint PRIMARY KEY,
   name text NOT NULL
);

INSERT INTO people VALUES
   (1, 'laurenz'),
   (2, 'julian'),
   (3, 'ants'),
   (4, 'kaarel');

/* this table is ill-defined */
CREATE TABLE rooms (
   id bigint PRIMARY KEY,
   data jsonb
);

INSERT INTO rooms VALUES
   (1,
    '{ 'name': 'Room 1',
       'reservations': [
           { 'who': 1, 'from': '2021-06-01 09:00:00', 'to': '2021-06-01 10:00:00' },
           { 'who': 3, 'from': '2021-06-01 10:00:00', 'to': '2021-06-01 11:30:00' },
           { 'who': 2, 'from': '2021-06-01 13:15:00', 'to': '2021-06-01 14:45:00' },
           { 'who': 1, 'from': '2021-06-01 15:00:00', 'to': '2021-06-01 16:00:00' }
        ]
     }'),
   (2,
    '{ 'name': 'Room 2',
       'reservations': [
           { 'who': 2, 'from': '2021-06-01 09:30:00', 'to': '2021-06-01 10:30:00' }
        ]
     }');

/* this table is fine */

CREATE TABLE people (

id bigint PRIMARY KEY,

name text NOT NULL

);

INSERT INTO people VALUES

(1, 'laurenz'),

(2, 'julian'),

(3, 'ants'),

(4, 'kaarel');

/* this table is ill-defined */

CREATE TABLE rooms (

id bigint PRIMARY KEY,

data jsonb

);

INSERT INTO rooms VALUES

(1,

'{ 'name': 'Room 1',

'reservations': [

{ 'who': 1, 'from': '2021-06-01 09:00:00', 'to': '2021-06-01 10:00:00' },

{ 'who': 3, 'from': '2021-06-01 10:00:00', 'to': '2021-06-01 11:30:00' },

{ 'who': 2, 'from': '2021-06-01 13:15:00', 'to': '2021-06-01 14:45:00' },

{ 'who': 1, 'from': '2021-06-01 15:00:00', 'to': '2021-06-01 16:00:00' }

]

}'),

(2,

'{ 'name': 'Room 2',

'reservations': [

{ 'who': 2, 'from': '2021-06-01 09:30:00', 'to': '2021-06-01 10:30:00' }

]

}');

First mistake: model regular columns as JSON attributes

There is no reason not to have the room name as a regular column. After all, every room will have a name, and we may want to enforce constraints like uniqueness on the room name.

Second mistake: model tabular data as JSON array

The room reservations are perfectly regular tabular data that define a many-to-many relationship between the rooms and the people. It would have been simple to model the same data with a junction table:

/* no primary key - we'll get to that later */
CREATE TABLE reservations (
   people_id bigint REFERENCES people NOT NULL,
   room_id   bigint REFERENCES rooms  NOT NULL,
   reserved  tsrange                  NOT NULL
);

INSERT INTO reservations VALUES
   (1, 1, '[2021-06-01 09:00:00,2021-06-01 10:00:00)'),
   (3, 1, '[2021-06-01 10:00:00,2021-06-01 11:30:00)'),
   (2, 1, '[2021-06-01 13:15:00,2021-06-01 14:45:00)'),
   (1, 1, '[2021-06-01 15:00:00,2021-06-01 16:00:00)'),
   (2, 2, '[2021-06-01 09:30:00,2021-06-01 10:30:00)');

/* no primary key - we'll get to that later */

CREATE TABLE reservations (

people_id bigint REFERENCES people NOT NULL,

room_id bigint REFERENCES rooms NOT NULL,

reserved tsrange NOT NULL

);

INSERT INTO reservations VALUES

(1, 1, '[2021-06-01 09:00:00,2021-06-01 10:00:00)'),

(3, 1, '[2021-06-01 10:00:00,2021-06-01 11:30:00)'),

(2, 1, '[2021-06-01 13:15:00,2021-06-01 14:45:00)'),

(1, 1, '[2021-06-01 15:00:00,2021-06-01 16:00:00)'),

(2, 2, '[2021-06-01 09:30:00,2021-06-01 10:30:00)');

Many people seem to think that storing few large rows in a table is more efficient than storing many small rows. There is some truth to that, since every row has some overhead, and PostgreSQL compresses large data. But if you want to retrieve only parts of the data, or want to modify them, many small rows are much more efficient - as we will see below.

Third mistake: store foreign keys in JSON

The "who" attribute stores a foreign key reference to people. That is not a good idea, because it is impossible for the database to enforce such a constraint: I could just as well have inserted a reference to a non-existing person. With the junction table from above, defining a foreign key is trivial.

Moreover, you often want to join on foreign keys. With JSON, that would require a cross join with the unnested JSON array:

SELECT rooms.data ->> 'name',
       people.name
FROM rooms
   CROSS JOIN LATERAL jsonb_array_elements(
                         rooms.data -> 'reservations'
                      ) AS res(j)
   JOIN people
      ON res.j ->> 'who' = people.id::text;

SELECT rooms.data ->> 'name',

people.name

FROM rooms

CROSS JOIN LATERAL jsonb_array_elements(

rooms.data -> 'reservations'

) AS res(j)

JOIN people

ON res.j ->> 'who' = people.id::text;

With the junction table, that would be

SELECT rooms.name,
       people.name
FROM rooms
   JOIN reservations AS r ON r.room_id = rooms.id
   JOIN people ON r.people_id = people.id;

SELECT rooms.name,

people.name

FROM rooms

JOIN reservations AS r ON r.room_id = rooms.id

JOIN people ON r.people_id = people.id;

You can probably guess which of these two queries will be more efficient.

Fourth mistake: modify JSON data

If you want to add a new reservation, you have to execute a statement like

UPDATE rooms
SET data = jsonb_set(
              data,
              '{reservations,100000}',
              '{'who': 3, 'from': '2021-06-01 11:00:00', 'to': '2021-06-01 12:00:00'}',
              TRUE
           )
WHERE id = 2;

UPDATE rooms

SET data = jsonb_set(

data,

'{reservations,100000}',

'{'who': 3, 'from': '2021-06-01 11:00:00', 'to': '2021-06-01 12:00:00'}',

TRUE

)

WHERE id = 2;

This will fetch the complete JSON object, construct a new JSON from it and store that new object in the table. The whole JSON object has to be read and written, which is more I/O than you would want - particularly if the JSON object is large and stored out of line.

Compare how simple the same exercise would be with the junction table:

INSERT INTO reservations VALUES
   (3, 2, '[2021-06-01 11:00:00,2021-06-01 12:00:00)');

1 2	INSERT INTO reservations VALUES (3, 2, '[2021-06-01 11:00:00,2021-06-01 12:00:00)');

This statement will only write a small amount of data.

Deleting a reservation is just as complicated and expensive, and is left as an exercise to the reader.

Fifth mistake: trying to enforce constraints on JSON

So far, our data model offers no protection against overlapping reservations, which would be good to enforce in the database.

With JSON, we are pretty much out of luck here. The best that comes to mind is a constraint trigger, but that would require elaborate locking or the SERIALIZABLE transaction isolation level to be free from race conditions. Also, the code would be far from simple.

With the junction table, the exercise is simple; all we have to do is to add an exclusion constraint that checks for overlaps with the && operator:

CREATE EXTENSION IF NOT EXISTS btree_gist;

ALTER TABLE reservations ADD EXCLUDE USING gist (
   reserved WITH &&,
   room_id WITH =
);

CREATE EXTENSION IF NOT EXISTS btree_gist;

ALTER TABLE reservations ADD EXCLUDE USING gist (

reserved WITH &&,

room_id WITH =

);

The extension is required to create a GiST index on a bigint column.

Sixth mistake: complicated searches in JSON

Simple searches for equality can be performed with the JSON containment operator @>, and such searches can be supported by a GIN index. But complicated searches are a pain.

Imagine we want to search for all rooms that are occupied at 2021-06-01 15:30:00. With JSON, that would look somewhat like

SELECT id
FROM rooms
   CROSS JOIN LATERAL jsonb_array_elements(
                         rooms.data -> 'reservations'
                      ) AS elem(j)
WHERE CAST(elem.j ->> 'from' AS timestamp) <= TIMESTAMP '2021-06-01 15:30:00'
  AND CAST(elem.j ->> 'to'   AS timestamp) >  TIMESTAMP '2021-06-01 15:30:00';

SELECT id

FROM rooms

CROSS JOIN LATERAL jsonb_array_elements(

rooms.data -> 'reservations'

) AS elem(j)

WHERE CAST(elem.j ->> 'from' AS timestamp) <= TIMESTAMP '2021-06-01 15:30:00'

AND CAST(elem.j ->> 'to' AS timestamp) > TIMESTAMP '2021-06-01 15:30:00';

With our junction table, the query becomes

SELECT room_id
FROM reservations
WHERE reserved @> TIMESTAMP '2021-06-01 15:30:00';

SELECT room_id

FROM reservations

WHERE reserved @> TIMESTAMP '2021-06-01 15:30:00';

That query can use the GiST index from the exclusion constraint we created above.

If all the above is wrong, should we use JSON in PostgreSQL at all?

Don't get me wrong: JSON support in PostgreSQL is a wonderful thing. It is just that many people don't understand how to use it right. For example, the majority of questions about PostgreSQL and JSON asked on Stackoverflow are about problems that arise from the use of JSON where it had better been avoided.

Follow these guidelines when you consider using JSON in PostgreSQL:

Don't use JSON for data that can easily be stored in database tables.
Avoid large JSON objects if you want to modify individual attributes.
Don't use JSON if you want to use attributes in complicated WHERE conditions.
Avoid JSON if you want to join on some of the attributes.
Don't use JSON if you want constraints on the data.
Don't store references to other tables in JSON attributes.

Often it may be a good idea to store some attributes as regular table columns and others in a JSON. The less you need to process the data inside the database, the better it is to store them as JSON.

A good example

To show an example of how JSON in the database can be used with benefit, let us consider a shop that sells all kinds of mixed goods. There are some properties that all or most of the goods will have, like price, weight, manufacturer, number available or package size. Other attributes may be rare, like the type of power plug, or ambiguous, like the pitch in screws or tuning forks.

Rather than defining a table with hundreds of columns for all the possible attributes, most of which will be NULL, we model the most frequent attributes with normal table columns and use a JSON for the rest:

CREATE TABLE wares (
   id           bigint           PRIMARY KEY,
   name         text             UNIQUE NOT NULL,
   price        numeric(15,2)    NOT NULL,
   weight_kg    double precision NOT NULL,
   available    integer          NOT NULL
      CHECK (available >= 0),
   manufacturer text,
   package_size integer          NOT NULL
      CHECK (package_size >= 0),
   attributes   jsonb
);

/* to search in 'attributes' */
CREATE INDEX ON wares USING gin (attributes);

/* for similarity search on 'name' */
CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE INDEX ON wares USING gin (name gin_trgm_ops);

CREATE TABLE wares (

id bigint PRIMARY KEY,

name text UNIQUE NOT NULL,

price numeric(15,2) NOT NULL,

weight_kg double precision NOT NULL,

available integer NOT NULL

CHECK (available >= 0),

manufacturer text,

package_size integer NOT NULL

CHECK (package_size >= 0),

attributes jsonb

);

/* to search in 'attributes' */

CREATE INDEX ON wares USING gin (attributes);

/* for similarity search on 'name' */

CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE INDEX ON wares USING gin (name gin_trgm_ops);

This will allow efficient queries like

SELECT name, price, available
FROM wares
WHERE name LIKE '%tuning fork%'
  AND attributes @> '{'pitch': 'a'}';

SELECT name, price, available

FROM wares

WHERE name LIKE '%tuning fork%'

AND attributes @> '{'pitch': 'a'}';

Conclusion

There is a lot you can do wrong when using JSON in PostgreSQL, especially if you are not acquainted with relational databases. However, it can be a powerful tool - when used properly.

0 0 votes

Article Rating

12 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Sammy Wij

1 year ago

Thanks!

Chris

2 years ago

Nice post. Thanks.

Diego Bernal

2 years ago

This is great, thank you very much!

yhuelf

10 months ago

This is a great article, thank you very much!

Minor fix : you should probably remove the primary key on reservations (room_id, people_id). It doesn't make sense since a same person can book the same room at another date.

laurenz

10 months ago

Reply to yhuelf

You are right.
I have fixed it to use an exclusion constraint.

Joshua

3 months ago

Good article, but I'd actually argue that even the good example doesn't need a jsonb field. For example:

* The "attributes" column could be its own "Attributes" table that stores all attributes as rows (including the ones that already exist for the "wares" table, so those would need to be removed from the wares table).

* Next, an associative/junction table called "WaresAttributes" (or something more descriptive) can store all of the many-to-many relationships between the "wares" and "Attributes" tables. Just the IDs of the wares and attributes is enough.

* Then one can use JOINs for the SELECT queries when needed.

The Dev in me understands wanting to use jsonb fields and be done with all of the hassle. However, the DBA in me knows that there is usually a better way as shown above (in my opinion).

laurenz

3 months ago

Reply to Joshua

You are proposing a kind of entity-attribute-value design. That is usually a bad idea, as I tried to describe in this article.

laurenz

10 months ago

Ah, yes. I didn't read that far. Thanks again!

yhuelf

10 months ago

The one in "Fifth mistake" is the right one 😉
(without people_id)

Anyway, not very important.

laurenz

3 months ago

You are right that yours is a much less harmful case of EAV, and it might work well enough for your use case.
It probably won't outperform the JSON solution, but without testing it, I cannot be certain.

Joshua

3 months ago

Reply to laurenz

Just to clarify some more a bit, my suggestion follows the standard principle of using an "associative entity/table" to model many-to-many relationships. This is a part of relational modeling/theory that has existed far before the "EAV" design pattern and certainly before JSON's creation (and especially before PostgreSQL integrated it). So labeling it as a "much less harmful EAV" is not accurate, in my opinion. Probably more accurate to say that "EAV" is a less efficient way of using the standard associative entity/table design of the relational model.

Regarding performance, I'd say the burden of proof is to be fully put on the jsonb types since it's the new kid on the block competing with an established standard. I don't disregard JSON's flexibility, just saying that many large businesses did just fine before jsonb --even with the scenario of having many attributes. So if you ever do a benchmark between those, I'd be one of the first to read it!

Joshua

3 months ago

Thanks for responding! Though after reading that article, I don't think my suggestion is the same as what you put in the examples. Even though they're both based on entity-relationship modeling, those examples are not as efficient. For example:

* Your examples have multiple attribute tables per type such as "attstring" and "attint". However, my suggestion has only one table ("Attributes") to store all of the attributes for the "wares" table. Any variation in types can be taken care of by using multiple columns. Admittedly, this does require more null checking, but hasn't been a problem in practice. Other solutions I have seen are to just store everything as text and convert to whatever is necessary.

* Your examples require the object ID to be inserted into every row of every attribute table. However, my suggestion has one associative table ("WaresAttributes") that stores the mappings between the wares and Attributes tables --meaning the Attributes table itself doesn't need to be aware of any wares entries.

* The number of SQL statements (along with the complexity) in your examples will be unnecessary, which you stated in the article, as it depends on how many type of attribute tables exist and the fact that you have to put the object ID in each INSERT. My suggestion would only ever deal with three tables max and generally just one table.

* As far as performance goes, my suggestion above is what we use at work (a large company) with many products and attributes having not a single performance issue with PostgreSQL.

At the end of the day, I appreciate your articles, and think both methods (your suggestion of using jsonb as well as what I suggested) can achieve the same thing. However, I'd be curious to see any benchmarks about jsonb fields being faster than what I suggested above rather than what is shown in the other article. Because if they're faster, and I mean faster enough to make a difference, then I may be convinced!

Stay tuned with our

JSON in PostgreSQL: how to use it right

A bad example

First mistake: model regular columns as JSON attributes

Second mistake: model tabular data as JSON array

Third mistake: store foreign keys in JSON

Fourth mistake: modify JSON data

Fifth mistake: trying to enforce constraints on JSON

Sixth mistake: complicated searches in JSON

If all the above is wrong, should we use JSON in PostgreSQL at all?

A good example

Conclusion

Laurenz Albe

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts