Measuring the Memory Overhead of a Postgres Connection

One fairly common complaint about postgres is that is that each connection uses too much memory. Often made when comparing postgres' connection model to one where each connection is assigned a dedicated thread, instead of the current model where each connection has a dedicated process.

To be clear: This is a worthwhile discussion to have. And there are several important improvements we could make to reduce memory usage.

That said, I think one common cause of these concerns is that the easy ways to measure the memory usage of a postgres backend, like top and ps, are quite misleading.

It is surprisingly hard to accurately measure the increase in memory usage by an additional connection.

In this post I’m mostly going to talk about running Postgres on Linux, since that is what I have the most experience with.

My bold claim is that, when measuring accurately, a connection only has an overhead of less than 2MiB (see conclusion).

A first look

Just using the common operating system tools make the overhead look much bigger than it actually is. Especially when not utilizing huge_pages (not recommended), the memory usage for each process will look high.

Let’s first look at a freshly established connection, in a freshly started postgres cluster:

andres@awork3:~$ psql
postgres[2003213][1]=# SELECT pg_backend_pid();
┌────────────────┐
│ pg_backend_pid │
├────────────────┤
│        2003213 │
└────────────────┘
(1 row)

andres@awork3:~/src/postgresql$ ps -q 2003213 -eo pid,rss
    PID   RSS
2003213 16944

About 16MiB.

Massive leaks!?! Luckily not.

What’s worse, the memory usage will appear to grow over time. To show this problem, I’ll use the pgprewarm extension to load all pages in a table into postgres' buffer pool:

postgres[2003213][1]=# SHOW shared_buffers ;
┌────────────────┐
│ shared_buffers │
├────────────────┤
│ 16GB           │
└────────────────┘
(1 row)

postgres[2003213][1]=# SELECT SUM(pg_prewarm(oid, 'buffer')) FROM pg_class WHERE relfilenode <> 0;
┌────────┐
│  sum   │
├────────┤
│ 383341 │
└────────┘

andres@awork3:~$ ps -q 2003213 -eo pid,rss
    PID   RSS
2003213 3169144

Now postgres memory usage appears to be around 3GB. Even though the individual connection did not actually allocate much additional memory. The added memory usage is proportional to the amount of shared buffers touched:

postgres[2003213][1]=# SELECT pg_size_pretty(SUM(pg_relation_size(oid))) FROM pg_class WHERE relfilenode <> 0;
┌────────────────┐
│ pg_size_pretty │
├────────────────┤
│ 2995 MB        │
└────────────────┘
(1 row)

And even worse than that, if yet another connection also uses those pages, it will also show as having a huge memory usage:

postgres[3244960][1]=# SELECT sum(abalance) FROM pgbench_accounts ;
┌─────┐
│ sum │
├─────┤
│   0 │
└─────┘
(1 row)

andres@awork3:~/src/postgresql$ ps -q 3244960 -eo pid,rss
    PID   RSS
3244960 2700372

Of course postgres does not actually use 3+2.7 GiB of memory in this case. Instead, what is happening is that, with huge_pages=off off, ps will attribute the amount of shared memory, including the buffer pool, that a connection has utilized for each connection. Obviously leading to vastly over-estimating memory usage.

Huge pages accidentally kind of save the day

Many CPU micro-architectures normally use a page size of 4KiB, but also optionally can use larger page sizes, most commonly 2MiB.

Depending on operating system, configuration, and the type of applications used such larger pages can be utilized transparently by the operating system, or explicitly by applications. See e.g. Debian wiki page about huge pages for some details.

Repeating the previous experiments with huge_pages=on makes them look a lot better. First, looking at a “new connection”:

andres@awork3:~$ ps -q 3245907 -eo pid,rss
    PID   RSS
3245907  7612

So, a new connection now appears to use only about ~7MiB. This reduction in memory usage is caused by the page table needing less memory, as it now only needs to contain 1/512th of the previous entries, due to the larger page size.

And more importantly, the test where a lot of the memory is accessed:

postgres[3245843][1]=# ;SELECT SUM(pg_prewarm(oid, 'buffer')) FROM pg_class WHERE relfilenode <> 0;
…
postgres[3245851][1]=# SELECT sum(abalance) FROM pgbench_accounts ;
…

andres@awork3:~$ ps -q 3245907,3245974 -eo pid,rss
    PID   RSS
3245907 12260
3245974  8936

In contrast to above, these connections now appear to only use 12MiB and 9MiB respectively, where previously they used 3GiB and 2.7GiB. Quite the apparent change ;)

This is due to the way the use of larger pages is implemented in Linux, not because we used orders of magnitude less memory: huge pages used just aren’t shown as part of the RSS column of ps/top.

Getting less unreal

Since Linux 4.5 the /proc/$pid/status file shows the memory usage split into finer sub-categories:

VmRSS		size of memory portions. It contains the three following parts (VmRSS = RssAnon + RssFile + RssShmem)
RssAnon		size of resident anonymous memory
RssFile		size of resident file mappings
RssShmem	size of resident shmem memory (includes SysV shm, mapping of tmpfs and shared anonymous mappings)

Looking at these stats with huge_pages=off:

andres@awork3:~$ grep -E '^(Rss|HugetlbPages)' /proc/3247901/status
RssAnon:	    2476 kB
RssFile:	    5072 kB
RssShmem:	    8520 kB
HugetlbPages:	       0 kB

postgres[3247901][1]=# SELECT SUM(pg_prewarm(oid, 'buffer')) FROM pg_class WHERE relfilenode <> 0;

andres@awork3:~$ ps -q 3247901 -eo pid,rss
    PID   RSS
3247901 3167164

andres@awork3:~$ grep -E '^(Rss|HugetlbPages)' /proc/3247901/status
RssAnon:	    3148 kB
RssFile:	    9212 kB
RssShmem:	 3154804 kB
HugetlbPages:	       0 kB

RssAnon is the amount of “anonymous” memory, i.e. memory allocations. RssFile are memory mapped files, including the postgres binary itself. And lastly, RssShmem shows the accessed non-huge_pages shared memory.

This nicely shows that most of the high memory usage ps etc show is due to the shared memory accesses.

And huge_pages=on:

andres@awork3:~$ grep -E '^(Rss|HugetlbPages)' /proc/3248101/status
RssAnon:	    2476 kB
RssFile:	    4664 kB
RssShmem:	       0 kB
HugetlbPages:	  778240 kB

postgres[3248101][1]=# SELECT SUM(pg_prewarm(oid, 'buffer')) FROM pg_class WHERE relfilenode <> 0;

andres@awork3:~$ grep -E '^(Rss|HugetlbPages)' /proc/3248101/status
RssAnon:	    3136 kB
RssFile:	    8756 kB
RssShmem:	       0 kB
HugetlbPages:    3846144 kB

Approximating Accuracy

Just adding up the memory usage of the non-shmem values still over-estimates memory usage. There’s two main reasons:

First, it doesn’t actually make sense to include RssFile when measuring a postgres backend’s memory usage - for postgres that overwhelmingly just are the postgres binary and the shared libraries it uses (postgres does not mmap() files). As nearly all of that is shared between all processes in the system, it’s not a per-connection overhead.

Secondly, even just looking at RssAnon over-estimates memory usage. The reason for that is that ps measures the entire memory of the process, even though the majority of a new connection’s overhead is shared between the user connection and the supervisor process. This is because Linux does not copy all memory when fork()ing a new process, instead it uses Copy-on-write to only copy pages when modified.

There is no good way to accurately measure the memory usage of an individual forked process, but since version 4.14 the Linux kernel at least provides an approximation (commit with description) in a process' /proc/[pid]/smaps_rollup file. Pss shows the “the process’ proportional share of this mapping” across all of a process' mappings (Search linux’s documentation page for smaps_rollup and Pss, which unfortunately does not have direct links). For memory shared between processes it will divide the memory usage by the number of processes using a mapping.


postgres[2004042][1]=# SELECT SUM(pg_prewarm(oid, 'buffer')) FROM pg_class WHERE relfilenode <> 0;
┌────────┐
│  sum   │
├────────┤
│ 383341 │
└────────┘
(1 row)

postgres[2004042][1]=# SHOW huge_pages ;
┌────────────┐
│ huge_pages │
├────────────┤
│ off        │
└────────────┘
(1 row)

andres@awork3:~$ grep ^Pss /proc/2004042/smaps_rollup
Pss:             3113967 kB
Pss_Anon:           2153 kB
Pss_File:           3128 kB
Pss_Shmem:       3108684 kB

Pss_Anon contains memory allocated by the process, Pss_File includes shared libraries etc linked into the process, and Pss_Shmem (if not using huge_pages) the shared memory usage divided across all the processes having touched the corresponding pages.

What makes the proportional values not perfect is that the divisor depends on the number of connections to the server. Here I use pgbench (scale 1000, -S, -M prepared -c 1024) to start a large number of connections:

postgres[2004042][1]=# SELECT count(*) FROM pg_stat_activity ;
┌───────┐
│ count │
├───────┤
│  1030 │
└───────┘
(1 row)

postgres[2004042][1]=# SELECT pid FROM pg_stat_activity WHERE application_name = 'pgbench' ORDER BY random() LIMIT 1;
┌─────────┐
│   pid   │
├─────────┤
│ 3249913 │
└─────────┘
(1 row)

andres@awork3:~$ grep ^Pss /proc/3249913/smaps_rollup
Pss:                4055 kB
Pss_Anon:           1185 kB
Pss_File:              6 kB
Pss_Shmem:          2863 kB

And with huge_pages=on:

andres@awork3:~$ grep ^Pss /proc/2007379/smaps_rollup
Pss:                1179 kB
Pss_Anon:           1173 kB
Pss_File:              6 kB
Pss_Shmem:             0 kB

The Pss values unfortunately do not account for resources only not visible to the application. E.g. the size of the page table is not included. The page table size is also visible in the aforementioned /proc/$pid/status file.

To my knowledge - but I am not certain - VmPTE (the page table size) is completely private for each process, but most other Vm* values, including the stack VmStk are shared in a copy-on-write manner.

Using the above, the overhead due to the page table is, with huge_pages=off:

andres@awork3:~$ grep ^VmPTE /proc/2004042/status
VmPTE:      6480 kB

and with huge_pages=on

VmPTE:	     132 kB

Of course there are some additional small overheads in the kernel for each process, but they are small enough not be worth being represented in the status file.

Conclusion

Based on the measurements above we can approximate that a connection running a very simplistic read-only OLTP workload has an overhead of about 7.6MiB with huge_pages=off and of 1.3MiB huge_pages=on by adding Pss_Anon to VmPTE.

Even when leaving room for some “invisible” overhead, a larger amount of data in Postgres' buffer pool, etc, I think this backs my earlier claim that the overhead of a connection is less than 2MiB.