LZ4 and ZSTD pg_dump compression

I wrote a “pg_dump compression specifications in PostgreSQL 16” post a while ago. Frankly speaking, I thought new compression methods would not be implemented in PostgreSQL until 2-3 years from now. Probably demand is so high that LZ4 and ZSTD made their way into PostgreSQL 16!

The LZ4 patch author is Georgios Kokolatos. Committed by Tomas Vondra. Reviewed by Michael Paquier, Rachel Heaton, Justin Pryzby, Shi Yu, and Tomas Vondra. The commit message is:

Expand pg_dump's compression streaming and file APIs to support the lz4
algorithm. The newly added compress_lz4.{c,h} files cover all the
functionality of the aforementioned APIs. Minor changes were necessary
in various pg_backup_* files, where code for the 'lz4' file suffix has
been added, as well as pg_dump's compression option parsing.

Author: Georgios Kokolatos
Reviewed-by: Michael Paquier, Rachel Heaton, Justin Pryzby, Shi Yu, Tomas Vondra
Discussion: https://postgr.es/m/faUNEOpts9vunEaLnmxmG-DldLSg_ql137OC3JYDmgrOMHm1RvvWY2IdBkv_CRxm5spCCb_OmKNk2T03TMm0fBEWveFF9wA1WizPuAgB7Ss%3D%40protonmail.com

The ZSTD patch author is Justin Pryzby. Committed by Tomas Vondra. Reviewed by Tomas Vondra, Jacob Champion, and Andreas Karlsson. The commit message is:

Allow pg_dump to use the zstd compression, in addition to gzip/lz4. Bulk
of the new compression method is implemented in compress_zstd.{c,h},
covering the pg_dump compression APIs. The rest of the patch adds test
and makes various places aware of the new compression method.

The zstd library (which this patch relies on) supports multithreaded
compression since version 1.5. We however disallow that feature for now,
as it might interfere with parallel backups on platforms that rely on
threads (e.g. Windows). This can be improved / relaxed in the future.

This also fixes a minor issue in InitDiscoverCompressFileHandle(), which
was not updated to check if the file already has the .lz4 extension.

Adding zstd compression was originally proposed in 2020 (see the second
thread), but then was reworked to use the new compression API introduced
in e9960732a9. I've considered both threads when compiling the list of
reviewers.

Author: Justin Pryzby
Reviewed-by: Tomas Vondra, Jacob Champion, Andreas Karlsson
Discussion: https://postgr.es/m/20230224191840.GD1653@telsasoft.com
Discussion: https://postgr.es/m/20201221194924.GI30237@telsasoft.com

Let’s try it out!

~$ pg_dump --version
pg_dump (PostgreSQL) 16devel

~$ pgbench --initialize --scale=100
dropping old tables...
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
creating tables...
generating data (client-side)...
10000000 of 10000000 tuples (100%) done (elapsed 39.52 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 49.65 s (drop tables 0.00 s, create tables 0.08 s, client-side generate 39.96 s, vacuum 0.29 s, primary keys 9.32 s).


~$ psql --command="select pg_size_pretty(pg_database_size('postgres'))"
 pg_size_pretty 
----------------
 1503 MB
(1 row)

~$ time pg_dump --format=custom --compress=lz4:9 > dump.lz4

real    0m10.507s
user    0m9.901s
sys     0m0.436s

~$ time pg_dump --format=custom --compress=zstd:9 > dump.zstd

real    0m8.794s
user    0m8.393s
sys     0m0.364s

~$ time pg_dump --format=custom --compress=gzip:9 > dump.gz

real    0m14.245s
user    0m13.064s
sys     0m0.978s

~$ time pg_dump --format=custom --compress=lz4 > dump_default.lz4

real    0m6.809s
user    0m1.666s
sys     0m1.125s

~$ time pg_dump --format=custom --compress=zstd > dump_default.zstd

real    0m7.534s
user    0m2.428s
sys     0m0.892s

~$ time pg_dump --format=custom --compress=gzip > dump_default.gz

real    0m11.564s
user    0m10.661s
sys     0m0.525s

~$ time pg_dump --format=custom --compress=lz4:3 > dump_3.lz4

real    0m8.497s
user    0m7.856s
sys     0m0.507s

~$ time pg_dump --format=custom --compress=zstd:3 > dump_3.zstd

real    0m5.129s
user    0m2.228s
sys     0m0.726s

~$ time pg_dump --format=custom --compress=gzip:3 > dump_3.gz

real    0m4.468s
user    0m3.654s
sys     0m0.504s

~$ ls -l --block-size=M
total 250M
-rw-rw-r-- 1 postgres postgres 28M Apr 18 13:58 dump_3.gz
-rw-rw-r-- 1 postgres postgres 48M Apr 18 13:57 dump_3.lz4
-rw-rw-r-- 1 postgres postgres  8M Apr 18 13:58 dump_3.zstd
-rw-rw-r-- 1 postgres postgres 27M Apr 18 13:57 dump_default.gz
-rw-rw-r-- 1 postgres postgres 50M Apr 18 13:56 dump_default.lz4
-rw-rw-r-- 1 postgres postgres  8M Apr 18 13:57 dump_default.zstd
-rw-rw-r-- 1 postgres postgres 27M Apr 18 13:56 dump.gz
-rw-rw-r-- 1 postgres postgres 48M Apr 18 13:55 dump.lz4
-rw-rw-r-- 1 postgres postgres  8M Apr 18 13:56 dump.zstd

Based on the output of the commands, we can conclude the following about the three compression methods:

  • gzip: This is a well-known and widely used compression algorithm that is known to provide a good balance between compression ratio and compression speed.
  • lz4: This is a very fast compression algorithm that provides a high compression and decompression speed at the cost of a lower compression ratio. The file sizes for the lz4-compressed dumps are in the range of 48-50 MB, which is significantly larger than the gzip-compressed dumps.
  • zstd: This is a relatively new compression algorithm that provides a high compression ratio and a good compression and decompression speed. The file sizes for the zstd-compressed dumps are in the range of 8-8.5MB, which is the smallest among the three compression methods.

The big surprise to me is that zstd takes the least amount of time for compression, followed by lz4 and gzip. This data probably is not the best to produce measurements and comparisons. However, that’s a topic for another blog post. At the default compression level, zstd produces the smallest dump file size, followed by lz4 and gzip. At the maximum compression level, zstd still produces the smallest dump file size, followed by gzip and lz4.

Based on these observations, if your priority is to reduce disk space usage, zstd is the recommended compression method. However, if your priority is to minimize compression time, zstd and lz4 both perform well. If compatibility with other utilities is a concern, gzip remains a viable option.

Finally…

pg_dump’s -Z/--compress in PostgreSQL 16 will support more than just an integer. It can be used to specify the method and level of compression used. The default is still gzip with a level of 6. But the new kids on the block, lz4 and zstd, are already here!

That said, pg_dump is sometimes used to update and/or upgrade the database. In case you want to understand the difference between an update and an upgrade, check out this blog post by Hans-Jürgen Schönig. Or check out our other related publications about updating and upgrading.


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.