This is an attempt to build a fairly comprehensive list of metrics you should be monitoring on a PostgreSQL 9.4 database server. If you’ve found any particular metrics useful which aren’t listed here, please let me know.

Many thanks to the Government PaaS team at GDS, who I’ve been working with, for giving me the chance to put this list together.

System Metrics

You should be recording a comprehensive set of system performance metrics for any database server, but these are the ones you really want to keep an eye on:

Free disk space

This seems obvious, but you want to keep 10% available disk space on the Postgres data partition, since disk space can fluctuate during vacuuming, especially with high write loads. Running out of disk space will (at best) be detrimental to your availability. Alert on it.

CPU Usage

Again, pretty self-explanatory; max out the CPU and you’re going to get slowdowns. High user CPU usage can indicate badly optimised queries, but if your queries are reasonably well-optimised, being CPU-bound is a pretty good situation for Postgres to be in.

High system CPU usage can be an indication of too much query parallelism (also check out the number of context switches).

I/O usage

CPU percentage of IOwait should be your first port of call if you’re seeing Postgres slowness – it indicates the amount of time the machine is waiting for the disk.

It’s worth keeping metrics for the various individual I/O stats (iops, merged I/O transactions, queue size, service time, average wait time, etc.) which will help you drill down on any I/O problems. If you’ve got separate partitions for WAL and data, break those down.

Postgres Metrics

Total number of connections

This will tell you how close you are to hitting your max_connections limit, and show up any clients which are leaking database connections.

SELECT count(*) FROM pg_stat_activity;

Number of connections by state

This query breaks down connections by state:

SELECT state, count(*) FROM pg_stat_activity GROUP BY state;

The possible states of interest are:

active
Connections currently executing queries. A large number tends to indicate DB slowness.
idle
Idle connections, not in a transaction.
idle in transaction
Connections with an open transaction, not executing a query. Lots of these can indicate long-running transactions.
idle in transaction (aborted)
Connection is in a transaction, but an error has occurred and the transaction hasn’t been rolled back.

Connections waiting for a lock

The number of connections blocked waiting for a lock can be an indicator of a slow transaction with an exclusive lock.

SELECT count(distinct pid) FROM pg_locks WHERE granted = false;

Maximum transaction age

Long-running transactions are bad because they prevent Postgres from vacuuming old data. This causes database bloat and, in extreme circumstances, shutdown due to transaction ID (xid) wraparound. Transactions should be kept as short as possible, ideally less than a minute.

Alert if this number gets greater than an hour or so.

SELECT max(now() - xact_start) FROM pg_stat_activity
                               WHERE state IN ('idle in transaction', 'active');

Checkpoint interval

Checkpointing too frequently is a bad thing for performance. Postgres will warn you in its logs about frequent checkpoints, or you can monitor the frequency in the pg_stat_bgwriter table (more info in this article by 2ndquadrant).

Query execution time

Ideally measure this at your application level, not at the Postgres level. Alternatively, log queries periodically by setting log_min_duration_statement = 0 and analyze them, or investigate the pg_stat_statements module (tools linked below).

Beware that a global average query execution time means very little – you should be breaking execution time down by query, and distribution often matters more than the average. Measure percentiles, especially when your web page loads depend on these queries.

Log Archiving

If you’re archiving logs, for example with WAL-E, you should at least:

  • Check that your WAL files are being archived (there should be a new one at least every checkpoint_timeout seconds).
  • Check your last base backup ran and succeeded.

Streaming Replication

Standby server status

Check this to tell you whether your standby is connected and replicating (run this query on the primary):

SELECT state FROM pg_stat_replication WHERE client_hostname = '<hostname>'

Standby server lag in bytes

This tells you how many bytes behind the primary the standby is (again, run on the primary):

SELECT pg_xlog_location_diff(sent_location, replay_location) AS byte_lag
       FROM pg_stat_replication WHERE client_hostname = '<hostname>'

There are queries which tell you the standby lag in seconds, but these tend to break on low-traffic databases when there are no writes being replicated.

More Performance Troubleshooting Tools

  • I have a collection of Postgres views which I’ve found useful when investigating performance issues.
  • pg_stat_statements is a module which collects detailed, normalised query statistics and exposes them as a database view for analysis.
  • pg-query-analyzer and pgbadger are two query log analysis tools. They’re good for doing one-off analysis, but you don’t want to be logging all queries continuously on a busy server.
  • pganalyze is a hosted performance monitoring service which makes use of pg_stat_statements. Their stats collector process is open source.
  • PgBouncer is a Postgres connection pool which can be helpful with some Postgres workloads. It provides some additional metrics which might be useful.

I'm currently looking for contract work in London or remote — if you're interested, get in touch.

To comment on this post, mention me on twitter, or drop me an email.