pg_analytics: Transforming Postgres into a Fast OLAP Database
Written by Ming Ying on February 1, 2024
Relative Query Time
Scores measured by Clickbench. Lower scores are better.
We’re excited to introduce pg_analytics
, an extension that accelerates the native analytical performance of
any Postgres database1 by 94x. With pg_analytics
installed, Postgres is 8x faster than Elasticsearch and nearly ties ClickHouse on
analytical benchmarks2.
Today, developers who store billions of data points in Postgres struggle with slow query times and poor data compression. Even with database tuning, complex analytical queries (e.g. counts, window functions, string aggregations) can take anywhere from minutes to hours. Many organizations turn to an external analytical data store like Elasticsearch as a result. This increases operational complexity as data becomes siloed and engineers must learn to use a new database.
By speeding up analytical queries directly inside Postgres, pg_analytics
is a drop-in solution for analytics in Postgres without the need to extract, transform, and
load (ETL) data into another system. The goal of this blog post is to share how pg_analytics
was
built and why now is an unprecedented time for building a Postgres-based analytical database.
How It Works
Regular Postgres tables, known as heap tables, organize data by row. While this makes sense for operational data, it is inefficient for analytical queries, which often scan a large amount of data from a subset of the columns in a table.
ParadeDB introduces a new kind of table called the parquet
table. parquet
tables behave like regular Postgres tables
but use a column-oriented layout via Apache Arrow and leverage Apache DataFusion, a query engine
optimized for column-oriented data. This means that users can choose between row and column-oriented storage at
table creation time.
Arrow and Datafusion are integrated with Postgres via two features of the Postgres API: the table access method and
executor hooks. The table access method registers parquet
tables with the Postgres catalog and handles data
manipulation language (DML) statements like inserts. Executor hooks intercept and reroute queries to DataFusion,
which parses the query, constructs an optimal query plan, executes it, and returns the results to Postgres.
Data is persisted to disk with Parquet, a highly-compressed file format for column-oriented data. Thanks to Parquet, ParadeDB compacts data 5x more than both regular Postgres and Elasticsearch.
Compressed Data Size
Data size (GB) of a 75.56GB CSV file, loaded into the database.
The final dependency is delta-rs
, a Rust-based implementation of Delta Lake. This library adds ACID
transactions, updates and deletes, and file compaction to Parquet storage. It also supports querying over data
lakes like S3, which introduces the future possibility connecting Postgres tables to cloud data lakes.
Why DataFusion
Building a state-of-the-art analytical database in Postgres is a task that’s expensive and difficult to get right.
One of the first Postgres-based analytical databases, Greenplum, was released in 2005. Since then, several companies like Citus and Timescale have built similar products. However, the performance gap between these databases and their non-Postgres, OLAP counterparts is wide. This is one reason that systems like Elasticsearch are popular even among companies that prefer Postgres.
Recently, embeddable query engines like DataFusion have changed the game by surpassing the query speed of many OLAP databases. DataFusion teases the idea of excellent analytical performance from any database — including Postgres.
Andy Pavlo, professor of databases at Carnegie Mellon, was right. Today, we’ve reached a point where it does not make sense to build a query engine from scratch within a database. Instead, the next generation of analytical databases should integrate existing, embeddable query engines3 like DataFusion that can continuously improve the database as the engine itself improves.
Getting Started
At the time of writing, pg_analytics
is open source and in an MVP state. Almost all Postgres queries and
basic operations like inserts and vacuums are supported. Our roadmap can be found in the project
README.
The easiest way to try pg_analytics
is by running the ParadeDB Docker image. Once connected, you can follow
this toy example.
CREATE EXTENSION pg_analytics;
-- Create a parquet table
CREATE TABLE t (a int) USING parquet;
-- pg_analytics supercharges the performance of any
-- Postgres query run on a parquet table
INSERT INTO t VALUES (1), (2), (3);
SELECT COUNT(*) FROM t;
The core ParadeDB team is focused on making pg_analytics
production-ready. Combined with pg_search
, our
Postgres extension for full-text search, pg_analytics
aims to make ParadeDB the best Elasticsearch alternative.
We welcome community contributions and are active on Slack. Finally, please don’t hesitate to show your support by giving us a star!
Footnotes
-
pg_analytics
is compatible with Postgres 12+ and can be installed on any self-hosted Postgres instance. -
According to Clickbench, a benchmarking tool for analytical databases.
-
We also evaluated DuckDB, Polars, and Velox as candidates for an embedded query engine. DuckDB is a popular in-process OLAP database, Polars is a dataframe processing library built on Arrow, and Velox is a query execution library built by Meta. DataFusion was chosen for three reasons. First, it interoperates with a storage framework like Delta Lake, which provides essential properties like ACID transactions. Secondly, its API was intended to be an embedded query engine inside another database, unlike standalone databases like Polars and DuckDB. Finally, it’s written in Rust and comes with a query parser and optimizer, unlike Velox.
Was this page helpful?