Postgres, the Good Parts: Arrays
21 August 2016
PostgreSQL, also known as Postgres, is "the world's most advanced open source database". But what makes it so advanced? In this post, we'll be covering what may be my favorite feature: arrays.
The Basics
Arrays allow you store lists of values in a single column.
SQL, being strongly typed, allows for only heterogeneous
arrays. Any non-array concrete type can be used as the type
of an array. Arrays of text
? Of course.
integer
? Yup. xml
? Uhh... if
you're that crazy.
Here's how you create a table with an array column in postgres:
(postgres@[local]:5432 13:55:54) [test]> create table array_test (id serial primary key, vals text[]); CREATE TABLE Time: 42.664 mswe don't need a primary key, but it will help with the examples.
Inserting array values:
(postgres@[local]:5432 13:57:05) [test]> insert into array_test (vals) values ('{foo, bar, "lol wut"}'); INSERT 0 1 Time: 3.938 ms
This is the quoted string format. You can also use the
array
format:
(postgres@[local]:5432 13:58:35) [test]> insert into array_test (vals) values (array['que', 'hrm', 'with spaces']); INSERT 0 1 Time: 2.029 ms
And this is what they look like:
id | vals ----+------------------------- 1 | {foo,bar,"lol wut"} 2 | {que,hrm,"with spaces"}
Updates can set the entire column value using the formats above. You can also set individual values:
(postgres@[local]:5432 14:03:47) [test]> begin; update array_test set vals[1] = 'fail' where id = 2; BEGIN Time: 0.199 ms UPDATE 1 Time: 3.332 ms (postgres@[local]:5432 14:04:09) [test]> commit; COMMIT Time: 3.733 ms (postgres@[local]:5432 14:04:38) [test]> select * from array_test where id = 2; id | vals ----+-------------------------- 2 | {fail,hrm,"with spaces"}
Note that arrays in postgres (and SQL) begin with index 1, not index 0.
Uninteresting so far
I agree. But let's play with some of the array functions that postgres provides.
One of the most useful is unnest
:
(postgres@[local]:5432 14:04:46) [test]> select unnest(vals) from array_test; unnest ------------- foo bar lol wut fail hrm with spaces
With unnest
, you can take array values and
make them queryable. You can then perform other operations,
such as group by
to get, for instance, counts
of individual items:
(postgres@[local]:5432 14:16:43) [test]> select val, count(*) from (select unnest(vals) val from array_test) as s group by 1; val | count -------------+------- foo | 1 fail | 1 hrm | 1 with spaces | 1 bar | 1 lol wut | 1
unnest
is probably the single most useful
function. The function cardinality
is next and
returns the number of elements in the array:
(postgres@[local]:5432 14:24:55) [test]> select *, cardinality(vals) from array_test; id | vals | cardinality ----+--------------------------+------------- 1 | {foo,bar,"lol wut"} | 3 2 | {fail,hrm,"with spaces"} | 3
Operators add some more flavor
Operators can be used for logical tests or array modifications. For most of my uses of arrays, operators dominate in their manipulation.
The &&
, @>
, and
<@
operators are commonly used logical
operators. These allow you to test for the overlap of two
arrays and the containment of one array by another.
postgres@[local]:5432 14:24:57) [test]> select * from array_test where vals && '{fail}'; id | vals ----+-------------------------- 2 | {fail,hrm,"with spaces"} (1 row) (postgres@[local]:5432 14:34:00) [test]> select * from array_test where vals @> '{fail,hrm}'; id | vals ----+-------------------------- 2 | {fail,hrm,"with spaces"} (1 row) (postgres@[local]:5432 14:34:34) [test]> select * from array_test where vals @> '{fail,hrm,nope}'; id | vals ----+------ (0 rows) (postgres@[local]:5432 14:34:40) [test]> select * from array_test where vals <@ '{fail,hrm}'; id | vals ----+------ (0 rows) (postgres@[local]:5432 14:35:41) [test]> select * from array_test where vals >@ '{fail,hrm,"with spaces","some other value"}'; id | vals ----+-------------------------- 2 | {fail,hrm,"with spaces"} (1 row)
The most useful array manipulation operator is
||
. This is used to join together arrays or to
append values to arrays:
(postgres@[local]:5432 14:40:32) [test]> select vals || 'hi'::text from array_test where id = 1; ?column? ------------------------ {foo,bar,"lol wut",hi} (1 row) Time: 1.218 ms (postgres@[local]:5432 14:40:45) [test]> select vals || '{hi, bye}' from array_test where id = 1; ?column? ---------------------------- {foo,bar,"lol wut",hi,bye} (1 row)
Of course, these operators can be used in
update
or delete
statements as
well.
For a complete list of functions and operators, see the documentation.
array_agg
Several functions produce arrays as well. The most
important of these is array_agg
. This function
allow you to aggregate values into an array with a query:
(postgres@[local]:5432 14:45:23) [test]> select array_agg(id) from array_test; array_agg ----------- {1,2}
Values of any type (as described above) can be aggregated:
(postgres@[local]:5432 14:45:40) [test]> create table timestamp_test (id serial primary key, val timestamp); CREATE TABLE Time: 13.396 ms (postgres@[local]:5432 14:47:37) [test]> insert into timestamp_test (val) values ('2016-01-01'), ('2015-01-01'), ('1970-01-01'); INSERT 0 3 Time: 2.914 ms (postgres@[local]:5432 14:48:03) [test]> select array_agg(val) from timestamp_test; array_agg --------------------------------------------------------------------- {"2016-01-01 00:00:00","2015-01-01 00:00:00","1970-01-01 00:00:00"} (1 row) Time: 0.385 ms
Applications
But what can I actually use arrays for? There are a handful that I have experience with:
- Lists
- Tags
- Materialized tree children
Lists. The obvious is lists. Lists of any type can
be captured in this way. Say your product needs to save
some list of user preferences, and you encode those
preferences as strings. User 1 has email preferences of
SPAM
and HAM
. You can just set
the user's email preferences like:
(postgres@[local]:5432 14:48:23) [test]> update users set email_preferences = '{SPAM,HAM}' where id = 1;
If you had used a separate table for this, you'd need to check which values exist first (to prevent integrity issues) or delete the old values and recreate everything. For simple lists, using arrays can help simplify your code.
Tags. This is how I feel in love with arrays. User-defined tags can be applied, for example, to displayed posts. This isn't that powerful until you want to filter to posts have a certain tag. The old way, with a separate tags table:
select * from posts a where exists ( select * from post_tags x where a.id = x.post_id and x.tag = 'foo' ) ;
Which is quite a bit to type and can be tough to get your ORM to generate. It can also slow down as you get more and more tags, even with proper indexes.
Inline tags?
select * from posts where tags && '{foo}'
And you can simply add more tags or specify that multiple tags must match. As a fabulous bonus, you can create indexes on arrays in postgres! A query like the above over hundreds of millions of results is still quite performant. To create such an index:
(postgres@[local]:5432 15:07:54) [test]> create index idx_array_test on array_test using gin (vals); CREATE INDEX
using
specifies gin
, which is a
Generalized INverted index. This is a good choice in
general. Specialized uses may benefit from using
gist
. You can learn more about these index
types here.
Materialized tree children. SQL is not particularly well-suited for dealing with data in tree form. Querying is not particularly efficient in terms of query writing or performance. Arrays can help you here when you have tree-structured data that is read more often than updated (which is true of most data, in my experience). You can store the direct or total set of children on each node and maintain this using a trigger. You can then quickly find all children from a node by querying for the ids of the children contained in that node's children array.
But shouldn't you be normalizing?
No. Well. Yes. Er. Maybe? Choosing the right abstraction for your needs is important. Using an array can be much simpler. But as soon as you need to encapsulate multiple values you should probably create a second table.
Performance is also a consideration. In the example given above, tags provided an enormous speed improvement when I used them at work. But always measure.
Lastly, arrays may be easier to use with your programming language or ORM. While many ORMs also allow you to treat related tables as lists natively, this may not always be available.
Wishlist
A major barrier to adopting arrays and using them effectively is foreign keys. In the tree children example above, failure to maintain the list of children correctly may result in incorrect results or other errors. Too, if you want to enforce a specific list of tags (or relate other information to those tags) it's a completely manual process.
Conclusion
This is the first post in what will be a series about the many awesome features of Postgres. While not comprehensive I hope it gives you some ideas about how to use arrays (or not to use arrays) in your codebase.