CYBERTEC Logo

JSON in PostgreSQL: how to use it right

09.2021 / Category: / Tags: | |
JSON, the lastest step in database technology
© Laurenz Albe 2021

 

The comprehensive JSON support in PostgreSQL is one of its best-loved features. Many people – particularly those with a stronger background in Javascript programming than in relational databases – use it extensively. However, my experience is that the vast majority of people don't use it correctly. That causes problems and unhappiness in the long run.

In this article, I will try to point out good and bad uses of JSON in PostgreSQL, and provide you with guidelines that you can follow.

A bad example

This data model exemplifies everything that you can do wrong:

First mistake: model regular columns as JSON attributes

There is no reason not to have the room name as a regular column. After all, every room will have a name, and we may want to enforce constraints like uniqueness on the room name.

Second mistake: model tabular data as JSON array

The room reservations are perfectly regular tabular data that define a many-to-many relationship between the rooms and the people. It would have been simple to model the same data with a junction table:

Many people seem to think that storing few large rows in a table is more efficient than storing many small rows. There is some truth to that, since every row has some overhead, and PostgreSQL compresses large data. But if you want to retrieve only parts of the data, or want to modify them, many small rows are much more efficient - as we will see below.

Third mistake: store foreign keys in JSON

The "who" attribute stores a foreign key reference to people. That is not a good idea, because it is impossible for the database to enforce such a constraint: I could just as well have inserted a reference to a non-existing person. With the junction table from above, defining a foreign key is trivial.

Moreover, you often want to join on foreign keys. With JSON, that would require a cross join with the unnested JSON array:

With the junction table, that would be

You can probably guess which of these two queries will be more efficient.

Fourth mistake: modify JSON data

If you want to add a new reservation, you have to execute a statement like

This will fetch the complete JSON object, construct a new JSON from it and store that new object in the table. The whole JSON object has to be read and written, which is more I/O than you would want - particularly if the JSON object is large and stored out of line.

Compare how simple the same exercise would be with the junction table:

This statement will only write a small amount of data.

Deleting a reservation is just as complicated and expensive, and is left as an exercise to the reader.

Fifth mistake: trying to enforce constraints on JSON

So far, our data model offers no protection against overlapping reservations, which would be good to enforce in the database.

With JSON, we are pretty much out of luck here. The best that comes to mind is a constraint trigger, but that would require elaborate locking or the SERIALIZABLE transaction isolation level to be free from race conditions. Also, the code would be far from simple.

With the junction table, the exercise is simple; all we have to do is to add an exclusion constraint that checks for overlaps with the && operator:

The extension is required to create a GiST index on a bigint column.

Sixth mistake: complicated searches in JSON

Simple searches for equality can be performed with the JSON containment operator @>, and such searches can be supported by a GIN index. But complicated searches are a pain.

Imagine we want to search for all rooms that are occupied at 2021-06-01 15:30:00. With JSON, that would look somewhat like

With our junction table, the query becomes

That query can use the GiST index from the exclusion constraint we created above.

If all the above is wrong, should we use JSON in PostgreSQL at all?

Don't get me wrong: JSON support in PostgreSQL is a wonderful thing. It is just that many people don't understand how to use it right. For example, the majority of questions about PostgreSQL and JSON asked on Stackoverflow are about problems that arise from the use of JSON where it had better been avoided.

Follow these guidelines when you consider using JSON in PostgreSQL:

  • Don't use JSON for data that can easily be stored in database tables.
  • Avoid large JSON objects if you want to modify individual attributes.
  • Don't use JSON if you want to use attributes in complicated WHERE conditions.
  • Avoid JSON if you want to join on some of the attributes.
  • Don't use JSON if you want constraints on the data.
  • Don't store references to other tables in JSON attributes.

Often it may be a good idea to store some attributes as regular table columns and others in a JSON. The less you need to process the data inside the database, the better it is to store them as JSON.

A good example

To show an example of how JSON in the database can be used with benefit, let us consider a shop that sells all kinds of mixed goods. There are some properties that all or most of the goods will have, like price, weight, manufacturer, number available or package size. Other attributes may be rare, like the type of power plug, or ambiguous, like the pitch in screws or tuning forks.

Rather than defining a table with hundreds of columns for all the possible attributes, most of which will be NULL, we model the most frequent attributes with normal table columns and use a JSON for the rest:

This will allow efficient queries like

Conclusion

There is a lot you can do wrong when using JSON in PostgreSQL, especially if you are not acquainted with relational databases. However, it can be a powerful tool - when used properly.

0 0 votes
Article Rating
Subscribe
Notify of
guest
12 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Sammy Wij
Sammy Wij
1 year ago

Thanks!

Chris
Chris
2 years ago

Nice post. Thanks.

Diego Bernal
Diego Bernal
2 years ago

This is great, thank you very much!

yhuelf
yhuelf
10 months ago

This is a great article, thank you very much!

Minor fix : you should probably remove the primary key on reservations (room_id, people_id). It doesn't make sense since a same person can book the same room at another date.

laurenz
laurenz
10 months ago
Reply to  yhuelf

You are right.
I have fixed it to use an exclusion constraint.

Joshua
Joshua
3 months ago

Good article, but I'd actually argue that even the good example doesn't need a jsonb field. For example:

* The "attributes" column could be its own "Attributes" table that stores all attributes as rows (including the ones that already exist for the "wares" table, so those would need to be removed from the wares table).

* Next, an associative/junction table called "WaresAttributes" (or something more descriptive) can store all of the many-to-many relationships between the "wares" and "Attributes" tables. Just the IDs of the wares and attributes is enough.

* Then one can use JOINs for the SELECT queries when needed.

The Dev in me understands wanting to use jsonb fields and be done with all of the hassle. However, the DBA in me knows that there is usually a better way as shown above (in my opinion).

laurenz
laurenz
3 months ago
Reply to  Joshua

You are proposing a kind of entity-attribute-value design. That is usually a bad idea, as I tried to describe in this article.

laurenz
laurenz
10 months ago

Ah, yes. I didn't read that far. Thanks again!

yhuelf
yhuelf
10 months ago

The one in "Fifth mistake" is the right one 😉
(without people_id)

Anyway, not very important.

laurenz
laurenz
3 months ago

You are right that yours is a much less harmful case of EAV, and it might work well enough for your use case.
It probably won't outperform the JSON solution, but without testing it, I cannot be certain.

Joshua
Joshua
3 months ago
Reply to  laurenz

Just to clarify some more a bit, my suggestion follows the standard principle of using an "associative entity/table" to model many-to-many relationships. This is a part of relational modeling/theory that has existed far before the "EAV" design pattern and certainly before JSON's creation (and especially before PostgreSQL integrated it). So labeling it as a "much less harmful EAV" is not accurate, in my opinion. Probably more accurate to say that "EAV" is a less efficient way of using the standard associative entity/table design of the relational model.

Regarding performance, I'd say the burden of proof is to be fully put on the jsonb types since it's the new kid on the block competing with an established standard. I don't disregard JSON's flexibility, just saying that many large businesses did just fine before jsonb --even with the scenario of having many attributes. So if you ever do a benchmark between those, I'd be one of the first to read it!

Joshua
Joshua
3 months ago

Thanks for responding! Though after reading that article, I don't think my suggestion is the same as what you put in the examples. Even though they're both based on entity-relationship modeling, those examples are not as efficient. For example:

* Your examples have multiple attribute tables per type such as "attstring" and "attint". However, my suggestion has only one table ("Attributes") to store all of the attributes for the "wares" table. Any variation in types can be taken care of by using multiple columns. Admittedly, this does require more null checking, but hasn't been a problem in practice. Other solutions I have seen are to just store everything as text and convert to whatever is necessary.

* Your examples require the object ID to be inserted into every row of every attribute table. However, my suggestion has one associative table ("WaresAttributes") that stores the mappings between the wares and Attributes tables --meaning the Attributes table itself doesn't need to be aware of any wares entries.

* The number of SQL statements (along with the complexity) in your examples will be unnecessary, which you stated in the article, as it depends on how many type of attribute tables exist and the fact that you have to put the object ID in each INSERT. My suggestion would only ever deal with three tables max and generally just one table.

* As far as performance goes, my suggestion above is what we use at work (a large company) with many products and attributes having not a single performance issue with PostgreSQL.

At the end of the day, I appreciate your articles, and think both methods (your suggestion of using jsonb as well as what I suggested) can achieve the same thing. However, I'd be curious to see any benchmarks about jsonb fields being faster than what I suggested above rather than what is shown in the other article. Because if they're faster, and I mean faster enough to make a difference, then I may be convinced!

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    12
    0
    Would love your thoughts, please comment.x
    ()
    x
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram