Saturday 24 January 2015

Read Scalability in PostgreSQL 9.5


In PostgreSQL 9.5, we will see a boost in scalability for read workload
when the data can fit in RAM.  I have ran a pgbench read-only load to
compare the performance difference between 9.4 and HEAD (62f5e447)
on IBM POWER-8 having 24 cores, 192 hardware threads, 492GB RAM
and here is the performance data


















The data is mainly taken for 2 kind of workloads, when all the data fits
in shared buffers (scale_factor = 300) and when all the data can't fit in
shared buffers, but can fit in RAM (scale_factor = 1000).

First lets talk about 300 scale factor case, in 9.4 it peaks at 32 clients,
now it peaks at 64 clients and we can see the performance improvement
upto (~98%) and it is better in all cases at higher client count starting from
32 clients.  Now the main work which lead to this improvement is
commit - ab5194e6 (Improve LWLock scalability).  The previous implementation
has a bottleneck around spin locks that were acquired for  LWLock
Acquisition and Release and the implantation for 9.5 has changed the
LWLock implementation to use atomic operations to manipulate the state.
Thanks to Andres Freund (and according to me the credit goes to reviewers
(Robert Haas and myself) as well who have reviewed multiple versions
of this patch) author of this patch due to whom many PostgreSQL users will
be happy.

Now lets discuss about 1000 scale factor case,  in this case, we could
see the good performance improvement (~25%) even at 32 clients and it
went upto (~96%) at higher client count, in this case also where in 9.4
it was peaking at 32 client count, now it peaks at 64 client count and
the performance is better at all higher client counts.  The main work
which lead to this improvement is commit id 5d7962c6 (Change locking
regimen around buffer replacement) and  commit id  3acc10c9 (Increase
the number of buffer mapping partitions to 128).  In this case there were
mainly 2 bottlenecks (a) a BufFreeList LWLock was getting acquired to
find a free buffer for a page (to find free buffer, it needs to execute
clock sweep)  which becomes bottleneck when many clients try to perform the
same action simultaneously (b) to change the association of buffer in
buffer mapping hash table a LWLock is acquired on a hash partition to
which the buffer to be associated belongs and as there were just 16
such partitions, there was huge contention when multiple clients starts
operating on same partition.  To reduce the bottleneck due to (a), used
a spinlock which is held just long enough to pop the freelist or advance
the clock sweep hand, and then released.  If we need to advance the
clock sweep further, we reacquire the spinlock once per buffer.  To reduce
the bottleneck due to (b), increase the buffer partitions to 128.  The crux
of this improvement is that we had to resolve both the bottlenecks (a and b)
together to see a major improvement in scalability.  The initial patch for
this improvement is prepared by me and then Robert Haas extracted the
important part of patch and committed the same.  Many thanks to both
Robert Haas and Andres Freund who not only reviewed the patch, but
given lot of useful suggestions during this work.


During the work on improvements in buffer management, I noticed that
the next bigger bottleneck that could buy us reasonably good improvement
in read workloads is in dynamic hash tables used to manage shared buffers,
so improving the concurrency of dynamic hash tables could help further
improving the read operation.  There was some discussion about using
concurrent hash table for shared buffers (patch by Robert Haas), but still
it has not materialized.

17 comments:

  1. Did you also test on a widely use platform like Intel Xeon in combination with Linux? Or is the performance improvement (98%) the same on a different platform?

    ReplyDelete
  2. I have not done the testing on Intel as I don't have access to some good box, however during the development of these features it has been tested by other developers on other environments and one of them has done on Intel as far as I remember and we have seen similar benefits on that platform. So the similar performance boost (as mentioned in this post) is expected on Intel.

    ReplyDelete
  3. Thanks for this article,

    Is there an existing read scalability comparison between PostgreSql 9.2 and 9.4?
    My production is using 9.2, and I'd like to know the possible gain compared with 9.2.

    Thanks in advance,
    Thomas

    ReplyDelete
    Replies
    1. The gain as compared to 9.2 should be atleast as mentioned in this blog. As such, I don't have scalability comparison data between 9.2 and 9.4. By the way, how much gain do you expect and isn't the gain shown in this blog sufficient for your use case?

      Delete
    2. Thank you Amit for your feedback.

      I have upgraded different systems in the past years, and even if upgrading always offer opportunities to have new functionnalities or fastest softwares.
      I had also encountered hindrances in upgrades, with worst behaviour on updated systems than when the forme version was running.
      That was the reason of my question, is there a perfomance gap on the read scalability between 9.2 & 9.4?

      Thanks for your clarification

      Delete
    3. I can understand the difficulties mentioned by you for upgrading. Ideally you can try benchmarking both 9.2 and 9.4 to see the gap, however I think you won't see such a major boost in Read scalability as you will see in 9.5. Apart from this between 9.2 and 9.4 there is certainly some work related to improvements in write workload (at this moment I don't have numbers) including WAL reduction for Update operation if that is of interest for you.

      Delete
  4. Was this benchmark run on an AIX or Linux OS ?

    ReplyDelete
  5. Other possibility is to use Read-Copy-Update (RCU) which achieves near perfect scalability to read mostly workloads. Was RCU considered for this case?

    ReplyDelete
    Replies
    1. I think RCU is used to prevent readers blocking writers and vice-versa which is already handled in PostgreSQL (it retains old version of row while performing write operation on a row and write creates a new version). Do you have something else in mind?

      Delete
    2. Yes I was referring to writers operating on the new version while retaining the old version. The old version is retained until all pre-exiting readers referring the old version are complete. With RCU even the atomic operation used to manipulate the LWLock state can be eliminated when acquiring the lock in shared mode.

      Delete
    3. The LWLock is used to protect shared data structures like partitioned hash-table or others, it is not used in row versioning scheme. Could you care to explain in more detail what's in you mind related to RCU?

      Delete
    4. Hey sorry for the confusion.

      From what I understand is that the shared data structures in PostgreSQL are protected by locks and they need to be protected by locks for concurrent access. In version 9.5 version, LWLock uses atomic operation instead of spinlocks for implementation. Performance improvement is expected with such changes and is evident from the experiments.

      With such change to LWLock, the locking overhead for both readers and writers is reduced to atomic operation. However, atomic operations are costly on large multicore systems and hence the point I am making is that can we eliminate the locking overhead of atomic operations for readers?

      It seems possible to eliminate atomic operation for readers (and not for writers) by using RCU synchronization mechanism. However there are caveats associated with RCU. So my question: Is RCU applicable to the scenario mentioned in the blog? Was RCU considered for improving read scalability?

      Related information can be found in the following link: www2.rdrop.com/~paulmck/RCU/urcutorture.2009.01.22a.pdf

      Delete
    5. An interesting idea, however first thing is that it might not be portable (like in Windows) and it seems to me that it doesn't help or has an overhead for write load and if there is any overhead for write operation, then it doesn't seem to be a good idea as that will not be a minority scenario. As far as your question regrading whether this is considered, as per my knowledge it was not discussed during development of the feature in community.

      Delete
  6. Does this patch or depend on the hardware transactional memory feature in Power-8 and recent Xeon CPUs?

    ReplyDelete
    Replies
    1. None of the patches mentioned in this blog depend on hardware transactional memory feature in Power-8.

      Delete