Troubleshooting Performance Issues Due to Disk and RAM

Enterprise PostgreSQL Solutions

Comments are off

Troubleshooting Performance Issues Due to Disk and RAM

Troubleshooting Performance Issues Due to Disk and RAM

Quite a few managers or clients expect DBAs to type away a few commands to fix a sluggish PostgreSQL instance. If only things were that simple. Unfortunately, debugging a sluggish instance that was previously running optimally requires analysis at many levels. In this blog, I will attempt to help that debugging process on Linux by identifying tools, tips and tricks that one can use when digging away a performance issue. I thought I could cover all the relevant topics in a single blog, but there is just far too much to write, so expect a series instead and this blog to be the first in the series of blogs.

Remember, a lot of things can go wrong, and it’s not always the database at fault. Applications, networks, or other factors are at play too. However, our approach will help identify the issue or remove PostgreSQL’s name from the charge sheet.

The Suspects

If we assume that there is an issue with the database server, we must start looking at the possible suspects that might be the troublemakers. Is it the hardware, the operating system, or the PostgreSQL instance? It’s a very simple list though:

  • Operating System/Hardware
    • I/O or Disk
    • Memory
    • CPU
    • Network
  • PostgreSQL
    • Query
    • Locks
    • Checkpoint

The Approach

When trying to debug a problem, I tend to reduce the search space first. So this is no different in case of PostgreSQL. We’ll approach this problem by first eliminating some suspects that are innocent. After that, we’ll interrogate the remaining suspects. 

Fortunately, some very basic and common Linux commands can help us do that. You can immediately identify and rule some culprits:

  • Access: issues can easily be identified “ping”, “ssh” or “psql” commands;
  • CPU Over Utilization: issues can be seen via the “top” command. Alternatively, the “uptime” command gives you the 1, 5, and 15 CPU load averages.
9:53  up 1 day, 22:24, 11 users, load averages: 1.32 2.27 3.08
  • Memory: usage can be seen via the “free” command.
  • Out of Disk Space: can be checked via the “df” command.

If all looks good, the problem must be elsewhere. Let’s jump to the next level now.

Checking Disk Performance

There are rarely any cases in production when the databases entirely fit into the memory. More often than not, lots and lots of I/O is done, both read and write, particular in case of an OLTP setup. On Linux, you can easily install additional tools to debug and analyse the disk load. The following packages may be installed via “APT” or “YUM” commands:

  • dstat
  • iotop
  • sysstat

The statistics package is enabled by default when you install the packages. Check the documentation for your platform to enable it.

iostat

Although it can provide details about CPU usage, it is a very useful tool to understand disk usage. 

iostat -dmx sda 1

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.06     0.89    1.37   18.03     0.04     0.63    70.75     0.04    1.95    0.24    2.08   0.05   0.09

The extended statistics for the disk displayed above provide some valuable insights. Where r/s w/s, rMB/s and wMB/s tell the read/write story, of particular interest are the values for “avgqu-sz”, wait columns and “%util”.

If the queue size has stacked to a significantly high value, it is a clear indication that the storage device is having a problem serving requests. Similarly, wait columns indicate the average time in milliseconds that a request spends in waiting. Another indication that the storage device is overloaded is “%util” value. This is an indication of the time the drive was doing at least one thing. This is fine for magnetic devices, however, it doesn’t accurately reflect when storage devices can handle many requests in parallel as is the case with SSDs and RAID. But it is still a valuable indicator to understand if the storage device has reached or nearing saturation.

sar

sar, short for System Activity Report, is a very useful tool. It is part of the sysstat package. It has tons of options that you use to monitor resource load. However, of particular interest is the option where you can analyze a log generated by the sysstat package (or see the load for a specific time period by specifying start and end times).

With logging enabled, from a database perspective, it gives one the power to understand how the system was behaving at a particular time by evaluating a log file. Say, the system was misbehaving at night from 2 am to 2:30 am on the 1st of February.

$ sar -f /var/log/sa/sa03 -b -s 02:00:00 -e 02:30:00 -r -S
Linux 3.10.0-1160.11.1.el7.x86_64 (dr_x) 02/03/2021 _x86_64_ (32 CPU)

02:00:02 AM       tps      rtps      wtps   bread/s   bwrtn/s
02:10:01 AM      1.30      0.11      1.19      0.55     16.93
02:20:01 AM      1.09      0.06      1.04      0.06     14.13
Average:         1.20      0.08      1.11      0.30     15.53

02:00:02 AM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
02:10:01 AM 713981872  78383572      9.89      1084  30076480  33264684      4.18  17525664  22328076        32
02:20:01 AM 713978052  78387392      9.89      1084  30077248  33260784      4.18  17529048  22328088        36
Average:    713979962  78385482      9.89      1084  30076864  33262734      4.18  17527356  22328082        34

02:00:02 AM kbswpfree kbswpused  %swpused  kbswpcad   %swpcad
02:10:01 AM   4194300         0      0.00         0      0.00
02:20:01 AM   4194300         0      0.00         0      0.00
Average:      4194300         0      0.00         0      0.00

You can use sar to see the usage statistics for RAM, disk, CPUs, SWAP and network devices.

dstat

This is the best looking tool of the lot as you can see from the output:

$ dstat -cdngy
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  1   1  99   0   0   0|  43k  640k|   0     0 |   0     0 |1830  3988 
  0   0 100   0   0   0|   0     0 |   0     0 |   0     0 | 720  2531 
  0   0 100   0   0   0|   0   188k| 354B 1152B|   0     0 | 799  2120 
  0   0 100   0   0   0|1024B   15k| 140B  338B|   0     0 | 799  2761 
  0   0 100   0   0   0|   0  8192B|  70B  362B|   0     0 | 683  1974 
  0   0 100   0   0   0|   0     0 |  70B  370B|   0     0 | 744  2069 
  0   0 100   0   0   0|   0     0 |  70B  362B|   0     0 | 747  2669

I won’t dig more deeply into this tool, but feel to experiment with it.

Understanding RAM

Generally, free is enough to tell us about memory usage. However, /proc/meminfo offers a little more detail. There are some subtle details that are good to know and can help in debugging a performance issue.

Here is a simple script that prints out useful memory information:

#!/bin/bash
grep -A3 "MemTotal" /proc/meminfo  
grep "Swap" /proc/meminfo
grep -A1 "Dirty\|Active" /proc/meminfo
MemTotal:       792365444 kB
MemFree:        711360772 kB
MemAvailable:   743213224 kB
Buffers:            1084 kB
SwapCached:            0 kB
SwapTotal:       4194300 kB
SwapFree:        4194300 kB
Active:         18503648 kB
Inactive:       23735296 kB
Active(anon):   10553220 kB
Inactive(anon):   534376 kB
Active(file):    7950428 kB
Inactive(file): 23200920 kB
--
Dirty:                56 kB
Writeback:             0 kB

Some of the key values to note are “Dirty”, “Writeback”, “Active(file)”, “Inactive(file)”, “Active(anon)”, and “Inactive(anon)”. “Dirty” depicts the amount of memory waiting to be written to disk, whereas, “Writeback” indicates the memory being actively written to disk. Active(file) and Inactive(file) are the OS file cache. You want this to be as high as possible as that would lead a reduced IO, thereby significantly improving performance.

Another key set of items to note is the “anon” memory. In PostgreSQL context, this will increase as the number of client connections increase. So the higher the “Active(anon)” and “Inactive(anon)”, the less there is for the OS to cache files. Therefore, keep a close watch on this, or alternatively, use a connection pooler.

If you have enabled “Hugpages” and configured for PostgreSQL, you might want to add a grep statement for it. Do remember to disable “Transparent Hugepages” though as that creates a memory allocation overhead which slows down PostgreSQL.

For more details about about /proc/meminfo, refer to its documentation here:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/s2-proc-meminfo

Conclusion

Although this blog has focused on disk and RAM issues, the tools mentioned here can be used to evaluate CPU and network utilization. Performance degradation because of these resources significantly impacts tps and the provided information should enable you to debug any capacity or configuration problems with hardware or the OS.

There will be the next part of this blog that covers specific extensions and changes to the PostgreSQL configuration that must be made to troubleshoot from PostgreSQL. Eventually, we’ll go through optimizations and tuning of a PostgreSQL cluster.