Postgres, the Good Parts: Arrays

21 August 2016

PostgreSQL, also known as Postgres, is "the world's most advanced open source database". But what makes it so advanced? In this post, we'll be covering what may be my favorite feature: arrays.

The Basics

Arrays allow you store lists of values in a single column. SQL, being strongly typed, allows for only heterogeneous arrays. Any non-array concrete type can be used as the type of an array. Arrays of text? Of course. integer? Yup. xml? Uhh... if you're that crazy.

Here's how you create a table with an array column in postgres:

(postgres@[local]:5432 13:55:54) [test]> create table array_test (id serial primary key, vals text[]);
CREATE TABLE
Time: 42.664 ms

we don't need a primary key, but it will help with the examples.

Inserting array values:

(postgres@[local]:5432 13:57:05) [test]> insert into array_test (vals) values ('{foo, bar, "lol wut"}');
INSERT 0 1
Time: 3.938 ms

This is the quoted string format. You can also use the array format:

(postgres@[local]:5432 13:58:35) [test]> insert into array_test (vals) values (array['que', 'hrm', 'with spaces']);
INSERT 0 1
Time: 2.029 ms

And this is what they look like:

 id |          vals
----+-------------------------
  1 | {foo,bar,"lol wut"}
  2 | {que,hrm,"with spaces"}

Updates can set the entire column value using the formats above. You can also set individual values:

(postgres@[local]:5432 14:03:47) [test]> begin; update array_test set vals[1] = 'fail' where id = 2;
BEGIN
Time: 0.199 ms
UPDATE 1
Time: 3.332 ms
(postgres@[local]:5432 14:04:09) [test]> commit;
COMMIT
Time: 3.733 ms
(postgres@[local]:5432 14:04:38) [test]> select * from array_test where id = 2;
 id |           vals
----+--------------------------
  2 | {fail,hrm,"with spaces"}

Note that arrays in postgres (and SQL) begin with index 1, not index 0.

Uninteresting so far

I agree. But let's play with some of the array functions that postgres provides.

One of the most useful is unnest:

(postgres@[local]:5432 14:04:46) [test]> select unnest(vals) from array_test;
   unnest
-------------
 foo
 bar
 lol wut
 fail
 hrm
 with spaces

With unnest, you can take array values and make them queryable. You can then perform other operations, such as group by to get, for instance, counts of individual items:

(postgres@[local]:5432 14:16:43) [test]> select val, count(*) from (select unnest(vals) val from array_test) as s group by 1;
     val     | count
-------------+-------
 foo         |     1
 fail        |     1
 hrm         |     1
 with spaces |     1
 bar         |     1
 lol wut     |     1

unnest is probably the single most useful function. The function cardinality is next and returns the number of elements in the array:

(postgres@[local]:5432 14:24:55) [test]> select *, cardinality(vals) from array_test;
 id |           vals           | cardinality
----+--------------------------+-------------
  1 | {foo,bar,"lol wut"}      |           3
  2 | {fail,hrm,"with spaces"} |           3

Operators add some more flavor

Operators can be used for logical tests or array modifications. For most of my uses of arrays, operators dominate in their manipulation.

The &&, @>, and <@ operators are commonly used logical operators. These allow you to test for the overlap of two arrays and the containment of one array by another.

postgres@[local]:5432 14:24:57) [test]> select * from array_test where vals && '{fail}';
 id |           vals
----+--------------------------
  2 | {fail,hrm,"with spaces"}
(1 row)

(postgres@[local]:5432 14:34:00) [test]> select * from array_test where vals @> '{fail,hrm}';
 id |           vals
----+--------------------------
  2 | {fail,hrm,"with spaces"}
(1 row)

(postgres@[local]:5432 14:34:34) [test]> select * from array_test where vals @> '{fail,hrm,nope}';
 id | vals
----+------
(0 rows)

(postgres@[local]:5432 14:34:40) [test]> select * from array_test where vals <@ '{fail,hrm}';
 id | vals
----+------
(0 rows)

(postgres@[local]:5432 14:35:41) [test]> select * from array_test where vals >@ '{fail,hrm,"with spaces","some other value"}';
 id |           vals
----+--------------------------
  2 | {fail,hrm,"with spaces"}
(1 row)

The most useful array manipulation operator is ||. This is used to join together arrays or to append values to arrays:

(postgres@[local]:5432 14:40:32) [test]> select vals || 'hi'::text from array_test where id = 1;
        ?column?
------------------------
 {foo,bar,"lol wut",hi}
(1 row)

Time: 1.218 ms
(postgres@[local]:5432 14:40:45) [test]> select vals || '{hi, bye}' from array_test where id = 1;
          ?column?
----------------------------
 {foo,bar,"lol wut",hi,bye}
(1 row)

Of course, these operators can be used in update or delete statements as well.

For a complete list of functions and operators, see the documentation.

`array_agg`

Several functions produce arrays as well. The most important of these is array_agg. This function allow you to aggregate values into an array with a query:

(postgres@[local]:5432 14:45:23) [test]> select array_agg(id) from array_test;
 array_agg
-----------
 {1,2}

Values of any type (as described above) can be aggregated:

(postgres@[local]:5432 14:45:40) [test]> create table timestamp_test (id serial primary key, val timestamp);
CREATE TABLE
Time: 13.396 ms
(postgres@[local]:5432 14:47:37) [test]> insert into timestamp_test (val) values ('2016-01-01'), ('2015-01-01'), ('1970-01-01');
INSERT 0 3
Time: 2.914 ms
(postgres@[local]:5432 14:48:03) [test]> select array_agg(val) from timestamp_test;
                              array_agg
---------------------------------------------------------------------
 {"2016-01-01 00:00:00","2015-01-01 00:00:00","1970-01-01 00:00:00"}
(1 row)

Time: 0.385 ms

Applications

But what can I actually use arrays for? There are a handful that I have experience with:

Lists
Tags
Materialized tree children

Lists. The obvious is lists. Lists of any type can be captured in this way. Say your product needs to save some list of user preferences, and you encode those preferences as strings. User 1 has email preferences of SPAM and HAM. You can just set the user's email preferences like:

(postgres@[local]:5432 14:48:23) [test]> update users set email_preferences = '{SPAM,HAM}' where id = 1;

If you had used a separate table for this, you'd need to check which values exist first (to prevent integrity issues) or delete the old values and recreate everything. For simple lists, using arrays can help simplify your code.

Tags. This is how I feel in love with arrays. User-defined tags can be applied, for example, to displayed posts. This isn't that powerful until you want to filter to posts have a certain tag. The old way, with a separate tags table:

select
    *
from
    posts a
where
    exists (
        select
            *
        from
            post_tags x
        where
            a.id = x.post_id
            and x.tag = 'foo'
    )
;

Which is quite a bit to type and can be tough to get your ORM to generate. It can also slow down as you get more and more tags, even with proper indexes.

Inline tags?

select * from posts where tags && '{foo}'

And you can simply add more tags or specify that multiple tags must match. As a fabulous bonus, you can create indexes on arrays in postgres! A query like the above over hundreds of millions of results is still quite performant. To create such an index:

(postgres@[local]:5432 15:07:54) [test]> create index idx_array_test on array_test using gin (vals);
CREATE INDEX

using specifies gin, which is a Generalized INverted index. This is a good choice in general. Specialized uses may benefit from using gist. You can learn more about these index types here.

Materialized tree children. SQL is not particularly well-suited for dealing with data in tree form. Querying is not particularly efficient in terms of query writing or performance. Arrays can help you here when you have tree-structured data that is read more often than updated (which is true of most data, in my experience). You can store the direct or total set of children on each node and maintain this using a trigger. You can then quickly find all children from a node by querying for the ids of the children contained in that node's children array.

But shouldn't you be normalizing?

No. Well. Yes. Er. Maybe? Choosing the right abstraction for your needs is important. Using an array can be much simpler. But as soon as you need to encapsulate multiple values you should probably create a second table.

Performance is also a consideration. In the example given above, tags provided an enormous speed improvement when I used them at work. But always measure.

Lastly, arrays may be easier to use with your programming language or ORM. While many ORMs also allow you to treat related tables as lists natively, this may not always be available.

Wishlist

A major barrier to adopting arrays and using them effectively is foreign keys. In the tree children example above, failure to maintain the list of children correctly may result in incorrect results or other errors. Too, if you want to enforce a specific list of tags (or relate other information to those tags) it's a completely manual process.

Conclusion

This is the first post in what will be a series about the many awesome features of Postgres. While not comprehensive I hope it gives you some ideas about how to use arrays (or not to use arrays) in your codebase.