11TB WAL in PostgreSQL

On a lovely Saturday afternoon our NetOps team raised an alarm: the disk usage on a specific database server was growing fast, and with the current growth rate we have about 30 hours left until the disk is full. Currently climbing over 80%, which triggered the alarm in the first place. Not good. However nothing serious as of now, but better investigate now and not wait until Monday.

The System

This particular server has about 3.5TB of disk space for the root partition (/), about 28TB for /var/lib/postgresql - which is about 6% used, and about 47TB for /var/lib/postgresql/nvme - this is the disk which was getting full. The data directory for PostgreSQL is linked into /var/lib/postgresql/nvme, and while we still have plenty of disk space on /var/lib/postgresql and could move objects over, it’s better to investigate what is causing the higher than normal disk usage.

This setup is 3 servers, all have the same storage capacity, one primary and two replicas, using physical replication. But only the primary is affected, the replicas are healthy, disk space usage is normal, and they don’t show any replication lag.

A first scan shows that pg_wal is 11TB in size. Something is causing the database to not recycle WAL files. By now the pg_wal directory holds about 1.3M files, and is growing. This number is the total count of files under pg_wal, which includes files under pg_wal/archive_status as well. Since the list of entries was too large for filename expansion, a simple find | wc -l was used.

postgres@db ~ $ du -sh db/data/pg_wal/

11T     db/data/pg_wal/

Investigation

The first idea is always that one of the replicas is not working properly, and the primary is holding WAL files for the replica. But this would raise a different alarm, and it would raise much earlier. And as stated above, both replicas show no replication lag.

A peek into the database log shows this error message:

2023-01-28 15:19:17.577 UTC,,,4820,,63d2ab67.12d4,1363028,,2023-01-26 16:33:43 UTC,,0,WARNING,01000,"archiving write-ahead log file ""0000000200093E7A00000062"" failed too many times, will try again later",,,,,,,,,""
2023-01-28 15:19:17.677 UTC,,,4820,,63d2ab67.12d4,1363029,,2023-01-26 16:33:43 UTC,,0,LOG,00000,"archive command failed with exit code 41","The failed archive command was: /usr/bin/pgbackrest --stanza=db-large archive-push pg_wal/0000000200093E7A00000062",,,,,,,,""
2023-01-28 15:19:17.848 UTC,,,4820,,63d2ab67.12d4,1363030,,2023-01-26 16:33:43 UTC,,0,LOG,00000,"archive command failed with exit code 41","The failed archive command was: /usr/bin/pgbackrest --stanza=db-large archive-push pg_wal/0000000200093E7A00000062",,,,,,,,""
2023-01-28 15:19:18.044 UTC,,,4820,,63d2ab67.12d4,1363031,,2023-01-26 16:33:43 UTC,,0,LOG,00000,"archive command failed with exit code 41","The failed archive command was: /usr/bin/pgbackrest --stanza=db-large archive-push pg_wal/0000000200093E7A00000062",,,,,,,,""
2023-01-28 15:19:18.044 UTC,,,4820,,63d2ab67.12d4,1363032,,2023-01-26 16:33:43 UTC,,0,WARNING,01000,"archiving write-ahead log file ""0000000200093E7A00000062"" failed too many times, will try again later",,,,,,,,,""

Another look at the logfiles itself shows that this is going on for a bit longer:

-rw-r--r-- 1 postgres postgres  31977717 Jan 26 23:59 db/pg_logs/pg_2023-01-26_160000.csv
-rw-r--r-- 1 postgres postgres 195048756 Jan 27 23:59 db/pg_logs/pg_2023-01-27_000000.csv
-rw-r--r-- 1 postgres postgres 148402413 Jan 28 15:20 db/pg_logs/pg_2023-01-28_000000.csv

Logfile from Thursday looks normal, logfile from both Friday and Saturday are much larger. Scanning through the log reveals that the problems with pgBackRest started almost immediately after the database was restarted on Thursday. That’s a new problem - we haven’t seen that before. We later found that the database was restarted 16:33:43 UTC, first error in the logfile appears 16:34:17 UTC.

At this point we did not really know what was going on, pgBackRest only logs errors in our configuration, but the error in the database log is not useful. We then stopped pgBackRest, and tried starting it manually - which resulted in the same error. The configuration in /etc/pgbackrest/pgbackrest.conf looked ok, however trying to access the pgBackRest logs resulted in a problem:

postgres@db ~ $ ls -ld /var/log/pgbackrest/*
ls: cannot access '/var/log/pgbackrest/db-archive-get-async.log': Permission denied
ls: cannot access '/var/log/pgbackrest/db-archive-get-async.log-20220511.gz': Permission denied
ls: cannot access '/var/log/pgbackrest/db-archive-get-async.log-20220512': Permission denied
ls: cannot access '/var/log/pgbackrest/db-large-archive-push-async.log': Permission denied
ls: cannot access '/var/log/pgbackrest/db-large-archive-push-async.log-20220519.gz': Permission denied
ls: cannot access '/var/log/pgbackrest/db-large-archive-push-async.log-20220521.gz': Permission denied
ls: cannot access '/var/log/pgbackrest/db-large-archive-push-async.log-20220522.gz': Permission denied
ls: cannot access '/var/log/pgbackrest/db-large-archive-push-async.log-20220523.gz': Permission denied
ls: cannot access '/var/log/pgbackrest/db-large-archive-push-async.log-20220524': Permission denied
ls: cannot access '/var/log/pgbackrest/db-large-expire.log': Permission denied
ls: cannot access '/var/log/pgbackrest/db-large-expire.log-20220520': Permission denied
ls: cannot access '/var/log/pgbackrest/db-large-stanza-create.log': Permission denied
ls: cannot access '/var/log/pgbackrest/db-large-stanza-create.log-20220515': Permission denied
ls: cannot access '/var/log/pgbackrest/db-restore.log': Permission denied
ls: cannot access '/var/log/pgbackrest/db-restore.log-20220511': Permission denied
ls: cannot access '/var/log/pgbackrest/all-server.log': Permission denied

Could this be the problem?

postgres@db ~ $ ls -ld /var/log/pgbackrest/
drw-r----- 2 postgres postgres 4096 May 24  2022 /var/log/pgbackrest/

The -x bit for the owner permissions is missing, also known as S_IXUSR. This bit is necessary in order to search or scan the directory, and accessing files in this directory requires this bit as well.

pgBackRest can’t access this directory, throws a cryptic error message and then stops working. Except it does not exit, it just stays around and consumes CPU.

Screenshot: pgBackRest processes running

The Bugfix - or so we thought

NetOps stopped pgBackRest again, fixed the permission of the directory to 0750, and started it again. But suddenly the permissions are back to 0640 - how about that?

Since the only action taken between the two steps was using the OpenRC init script, something must be wrong there. And indeed, there is a line in the init script which forces permissions for the directory:

        checkpath -d -o postgres:postgres -m 0640 "/var/log/pgbackrest"
        checkpath -f -o postgres:postgres -m 0640 "/var/log/pgbackrest/all-server.log"

It’s two lines, one for the directory, one for the logfile. The permissions for the logfile are correct, however the permissions for the directory are the same as for the file - that is wrong. And that is causing the problem. Probably whoever wrote this init script just copied the line and forgot to change it for the directory. And no review caught this. It was working before, since pgBackRest was not restarted along with the database before. Only on this faithful last Thursday …

The actual bugfix

The line in the OpenRC init script was temporarily changed to fix the permissions of the init script. Then pgBackRest was restarted again, and this time it no longer throws error messages into the logfile. Progress!

It started to slowly pick up WAL files from the database, but given the amount of WAL files in pg_wal, this will take a while to recover. In the end it took about a day - still much faster than it took to generate all the WAL. PostgreSQL 15 has an improvement for pgarch_readyXlog() which avoids scanning pg_wal/archive_status for every single file. This will help in the future, once we upgraded this system.

We also discovered that on the weekend this particular system still generates around 400 WAL files per minute.

Aftermath

We need better monitoring for WAL files on this system. The general monitoring for disk space caught this early enough, but getting an alert when we hit ~500 GB WAL is better than getting an alert when we hit 11 TB.

The bug in the OpenRC init script must be fixed, and deployed to all affected systems.

The current backup in pgBackRest must be verified, just in case. And it might be a good idea to generate a new full backup and rotate the current one out.

The System#

Investigation#

The Bugfix - or so we thought#

The actual bugfix#

Aftermath#

The System

Investigation

The Bugfix - or so we thought

The actual bugfix

Aftermath