Speeding up GROUP BY in PostgreSQL

03.2019 / Category: Performance / Tags: tuning

In SQL the GROUP BY clause groups records into summary rows and turns large amounts of data into a smaller set. GROUP BY returns one records for each group. While most people know how to use GROUP BY not many actually know how to squeeze the last couple of percentage points out of the query. There is a small optimization, which can help you to speed up things by a couple of percents quite reliably. If you want to speed up GROUP BY clauses, this post is for you.

Table of Contents

Creating a test data set in PostgreSQL

To prepare ourselves for the aggregation we first have to generate some data:

test=# CREATE TABLE t_agg (x int, y int, z numeric);
CREATE TABLE
test=# INSERT INTO t_agg SELECT id % 2, id % 10000, random() 
	FROM 	generate_series(1, 10000000) AS id;
INSERT 0 10000000

1

2

3

4

5

test=# CREATE TABLE t_agg (x int, y int, z numeric);

CREATE TABLE

test=# INSERT INTO t_agg SELECT id % 2, id % 10000, random()

FROM generate_series(1, 10000000) AS id;

INSERT 0 10000000

The interesting part is that the first column only has 2 distinct values while the second column will contain 10.000 different values. That is going to be of great importance for our optimization efforts.

Let us VACUUM the table to set hint bits and to build optimizer statistics. To make those execution plans more readable I also decided to turn off parallel queries:

test=# VACUUM ANALYZE ;
VACUUM
test=# SET max_parallel_workers_per_gather TO 0;
SET

1

2

3

4

test=# VACUUM ANALYZE ;

VACUUM

test=# SET max_parallel_workers_per_gather TO 0;

SET

Running an aggregation

Now that the is in place the first tests can be started:

test=# explain analyze SELECT x, y, avg(z) FROM t_agg GROUP BY 1, 2;
                                               QUERY PLAN                                                         
--------------------------------------------------------------------------------------------
 HashAggregate  (cost=238697.01..238946.71 rows=19976 width=40) 
		(actual time=3334.320..3339.929 rows=10000 loops=1)
   Group Key: x, y
   ->  Seq Scan on t_agg  (cost=0.00..163696.15 rows=10000115 width=19) 
		(actual time=0.058..636.763 rows=10000000 loops=1)
 Planning Time: 0.399 ms
 Execution Time: 3340.483 ms
(5 rows)

1

2

3

4

5

6

7

8

9

10

11

test=# explain analyze SELECT x, y, avg(z) FROM t_agg GROUP BY 1, 2;

QUERY PLAN

--------------------------------------------------------------------------------------------

HashAggregate (cost=238697.01..238946.71 rows=19976 width=40)

(actual time=3334.320..3339.929 rows=10000 loops=1)

Group Key: x, y

-> Seq Scan on t_agg (cost=0.00..163696.15 rows=10000115 width=19)

(actual time=0.058..636.763 rows=10000000 loops=1)

Planning Time: 0.399 ms

Execution Time: 3340.483 ms

(5 rows)

PostgreSQL will read the entire table sequentially and perform a hash aggregate. As you can see most of the time is burned by the hash aggregate (3.3 seconds minus 636 milliseconds). The resultset contains 6000 rows. However, we can do better. Keep in mind that the first column does not contain as many different values as the second column. That will have some implications as far as the hash aggregate is concerned. Let us try to play around with the GROUP BY clause

Changing aggregation order can improve performance

Let us run the same query again. But this time we won’t use “GROUP BY x, y” but instead use “GROUP BY y, x”. The result of the statement will be exactly the same as before (= 10.000 groups). However, the slightly modified query will be faster:

test=# explain analyze SELECT x, y, avg(z) FROM t_agg GROUP BY 2, 1;
                                                        QUERY PLAN                                                         
-----------------------------------------------------------------------------------------------------
 HashAggregate  (cost=238697.01..238946.71 rows=19976 width=40) 
		(actual time=2911.989..2917.276 rows=10000 loops=1)
   Group Key: y, x
   ->  Seq Scan on t_agg  (cost=0.00..163696.15 rows=10000115 width=19) 
		(actual time=0.052..580.747 rows=10000000 loops=1)
 Planning Time: 0.144 ms
 Execution Time: 2917.706 ms
(5 rows)

1

2

3

4

5

6

7

8

9

10

11

test=# explain analyze SELECT x, y, avg(z) FROM t_agg GROUP BY 2, 1;

QUERY PLAN

-----------------------------------------------------------------------------------------------------

HashAggregate (cost=238697.01..238946.71 rows=19976 width=40)

(actual time=2911.989..2917.276 rows=10000 loops=1)

Group Key: y, x

-> Seq Scan on t_agg (cost=0.00..163696.15 rows=10000115 width=19)

(actual time=0.052..580.747 rows=10000000 loops=1)

Planning Time: 0.144 ms

Execution Time: 2917.706 ms

(5 rows)

Wow, the query has improved considerably. We saved around 400ms, which is a really big deal. The beauty is that we did not have to rearrange the data, change the table structure, adjust memory parameters or make any other changes to the server. All I did was to change the order in which PostgreSQL aggregated the data.

Which conclusions can developers draw from this example? If you are grouping by many different columns: Take the ones containing more distinct values first and group by the less frequent values later. It will make the hash aggregate run more efficiently in many cases. Also try to make sure that work_mem is high enough to make PostgreSQL trigger a hash aggregate in the first place. Using a hash is usually faster than letting PostgreSQL use the “group aggregate”.

It is very likely that future versions of PostgreSQL (maybe starting with PostgreSQL 12?) will already do this kind of change automatically. A patch has already been proposed by Teodor Sigaev and I am quite confident that this kind of optimization will make it into PostgreSQL 12. However, in the meantime it should be easy to make the change by hand and enjoy a nice, basically free speedup.

If you want to learn more about GROUP BY, aggregations and work_mem in general, consider checking out my blog post about this topic. On behalf of the entire team I wish everybody “happy performance tuning”. If you want to learn more about aggregation and check out Teodor Sigaev's patch, check out the PostgreSQL mailing list.

If you want to learn more about performance tuning, advanced SQL and so on, consider checking out one of our posts about window functions and analytics.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

0 0 votes

Article Rating

Subscribe

5 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Luca Ferrari

5 years ago

One thing I suspect you should mention is that this is assuming the cache is empty. I mean, if you execute both explains with different grouping, repeatadly, chances are times are going to become really close to each other because buffers have cached the data pages.

0

Reply

Michael Ball

5 years ago

While one is faster, the question is _why_? From reading the query plans I don't understand why there's a difference at the DB level. (I suppose OS factors could make a difference.)

0

Reply

tokers

5 years ago

Hi!

I executed the same operations on my local machine, but there the cost time is very close, about tens of milliseconds.

test=# SELECT version(); version ------------------------------------------------------------------------------------------------------------------------------ PostgreSQL 10.5 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 8.2.1 20190103 [gcc-8-branch revision 267549], 64-bit


test=# explain analyze SELECT x, y, avg(z) FROM t_agg GROUP BY x, y;

                                                        QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------

 HashAggregate  (cost=238697.01..238948.09 rows=20086 width=40) (actual time=3643.140..3649.182 rows=10000 loops=1)

   Group Key: x, y

   ->  Seq Scan on t_agg  (cost=0.00..163696.15 rows=10000115 width=19) (actual time=0.059..608.040 rows=10000000 loops=1)

 Planning time: 0.335 ms

 Execution time: 3649.622 ms

(5 rows)

test=# explain analyze SELECT x, y, avg(z) FROM t_agg GROUP BY y,x; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=238697.01..238948.09 rows=20086 width=40) (actual time=3665.056..3671.021 rows=10000 loops=1) Group Key: y, x -> Seq Scan on t_agg (cost=0.00..163696.15 rows=10000115 width=19) (actual time=0.035..608.184 rows=10000000 loops=1) Planning time: 0.084 ms Execution time: 3671.382 ms (5 rows) (1 row)

0

Reply

David Ignjić

5 years ago

Reply to tokers

I just test in 9.2 no difference:
employeeapp=# explain analyze SELECT x, y, avg(z) FROM t_agg GROUP BY 1,2;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=238693.37..238942.20 rows=19906 width=19) (actual time=14005.332..14012.209 rows=10000 loops=1)
-> Seq Scan on t_agg (cost=0.00..163694.07 rows=9999907 width=19) (actual time=0.041..1043.523 rows=10000000 loops=1)
Total runtime: 14012.770 ms
(3 rows)

employeeapp=# explain analyze SELECT x, y, avg(z) FROM t_agg GROUP BY 2,1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=238693.37..238942.20 rows=19906 width=19) (actual time=14229.643..14236.295 rows=10000 loops=1)
-> Seq Scan on t_agg (cost=0.00..163694.07 rows=9999907 width=19) (actual time=0.019..1063.704 rows=10000000 loops=1)
Total runtime: 14236.875 ms
(3 rows)

employeeapp=# explain analyze SELECT x, y, avg(z) FROM t_agg GROUP BY 2,1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=238693.37..238942.20 rows=19906 width=19) (actual time=14462.632..14471.046 rows=10000 loops=1)
-> Seq Scan on t_agg (cost=0.00..163694.07 rows=9999907 width=19) (actual time=0.026..1084.405 rows=10000000 loops=1)
Total runtime: 14471.664 ms
(3 rows)

employeeapp=# select version();
version
----------------------------------------------------------------------------------------------------------------
PostgreSQL 9.2.24 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16), 64-bit
(1 row)

0

Reply

Luca Ferrari

5 years ago

Reply to David Ignjić

Hi tested against 11.1 and it works as described:

testdb=> explain analyze SELECT x, y, avg(z) FROM t_agg GROUP BY 1, 2; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=238697.01..238948.39 rows=20110 width=40) (actual time=4648.597..4655.752 rows=10000 loops=1) Group Key: x, y -> Seq Scan on t_agg (cost=0.00..163696.15 rows=10000115 width=19) (actual time=0.653..1115.263 rows=10000000 loops=1) Planning Time: 2.039 ms Execution Time: 4656.669 ms (5 rows)


testdb=> explain analyze SELECT x, y, avg(z) FROM t_agg GROUP BY 2, 1;

                                                        QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------

 HashAggregate  (cost=238697.01..238948.39 rows=20110 width=40) (actual time=4373.100..4379.107 rows=10000 loops=1)

   Group Key: y, x

   ->  Seq Scan on t_agg  (cost=0.00..163696.15 rows=10000115 width=19) (actual time=1.312..981.379 rows=10000000 loops=1)

 Planning Time: 0.219 ms

 Execution Time: 4380.018 ms

(5 rows)

testdb=> show server_version; server_version ---------------- 11.1

0

Reply