will_paginate is one of those gems that always makes me smile. Back in the dark ages of Spring MVC & Hibernate I specifically remember being part of an estimation that decided that adding pagination to an admin page would take 2 days. Today that seems crazy of course, but I don’t think it was a bad estimate at the time since there really wasn’t any built-in support for that. Fast forward to now and as we all know pagination is one of those things that take about 2 minutes to implement. Progress!

So pagination is great and a pretty much solved problem, but it’s always irked me that it requires 2 very similar queries per request. Is there anything to be done about that? Let’s see.

Pagination at the database level should be very simple. All it requires is adding a LIMIT and OFFSET to your query. Of course if you add a LIMIT to your query, you can no longer simply count the number of rows in order to get the total number of matching results and we’ve got to know how many total matches there are so that we can calculate the total number of pages. So it seems like we’re forced to query for that separately and that’s exactly what will_paginate does. Here’s a simple example from the PatientsLikeMe patient search.

SELECT count(*) FROM users WHERE users.role_id=1 and users.verified is true 
and users.deleted is false and users.birth_date < now() - interval '20 year' 
and users.sex in ('M') AND users.last_update_time IS NOT NULL
SELECT users.id FROM users WHERE users.role_id=1 and users.verified is true 
and users.deleted is false and users.birth_date < now() - interval '20 year' 
and users.sex in ('M') AND users.last_update_time IS NOT NULL 
ORDER BY users.last_update_time desc LIMIT 15 OFFSET 0

Two of those simple queries are no big deal, but as you start to add disease specific search criteria, you'll find that the queries can get a lot more complex very quickly.

SELECT count(*) FROM users WHERE 
     users.role_id=1 
     and users.verified is true 
     and users.deleted is false 
     and users.birth_date < now() - interval '20 year' 
     and users.sex in (E'M') 
     and users.id in ( SELECT ci.user_id FROM condition_infos ci WHERE ci.condition_id = E'4' 
     and first_symptom_date between '1911-04-06' and '2004-04-06') 
     and users.id in ( SELECT ci.user_id FROM condition_infos ci WHERE ci.condition_id = E'4' 
     and first_definitive_diagnosis_date between '1992-04-06' and '2111-04-06')       
     and users.id in ( WITH latest_outcome as (select * from (select *, rank() 
     over (PARTITION BY user_id order by date desc, id desc) from pdrs_surveys) ranked where rank = 1) 
     select user_id from latest_outcome where updrs_score >= 10 ) 
     and users.condition_ids @> '{4}' 
     and users.last_update_time IS NOT NULL

Footnote: note that I get that this query is not perfect, but do keep in mind that this comes out of a dynamic query generator and is not hand-written.

Having to run that query twice starts to affect response time. But what’s an engineer to do? Is it possible to easily make Postgres return both the limited results and the overall count in one go? It sure is! Common Table Expressions and Window Functions to the rescue.

Common Table Expressions are available in Postgres 8.4 and above and are just about my favorite SQL hammer. I find they allow for much much cleaner SQL when the going gets rough, because they allow you to separate elements of a query into small compassable parts.

Window Functions are brilliant as well and they’re going to do the heavy-ish lifting in our one-hit SQL pagination query. Generally they’re used in order to calculate something over a subset of results, but the trick that makes the following code work is they’re also happy to run over the entire result set by just not specifying a PARTITION.

Combining these two tricks together, we can create the following 1-shot query to replace to two queries we started with.

WITH sql_query as (SELECT users.id FROM users 
     WHERE users.role_id=1 and users.verified is true 
     and users.deleted is false and users.birth_date < now() - interval '20 year' 
     and users.sex in (E'M') 
     AND users.last_update_time IS NOT NULL ORDER BY users.last_update_time desc ) 
select count(*) OVER () as total_entries, * from sql_query LIMIT 15 OFFSET 0

This returns all of the userids, but adds a column to each row that contains the totalentries.

So what’s the performance improvement? Well for the hairy query above, on my local machine I get:

Count query: ~157ms UserID query: ~165ms

vs

Combined Query: ~164ms

That’s free as in beer!

The last step is to integrate this with willpaginate. I’ve only taken a quick stab at that so far but it was simple to implement. All we need to do is wrap the pre-existing query within our CTE, run the query and then extract value of the totalentries column from any of the rows. A full willpaginate speedup would require detecting Postgres and implementing this for the other finders instead of just paginateby_sql.

I hope you enjoyed this episode of Mav hits another SQL query with his Common Table Expression hammer. Stay tuned for “OMG You did that INSIDE! the JOIN”.