New Postgres superpowers in Hyperscale (Citus) with Citus 10

marcoslot · ‎May 08 2021

PostgreSQL is an excellent database for a wide range of workloads. Traditionally, the only problem with Postgres is that it is limited to a single machine. If you are using the Azure Database for PostgreSQL managed service, that limitation no longer applies to you because you can use the built-in Hyperscale (Citus) option—to transparently shard and scale out both transactional and analytical workloads. And Hyperscale (Citus) just keeps getting better and better.

The heart of Hyperscale (Citus) is the open source Citus extension which extends Postgres with distributed database superpowers. Every few months we release a new version of Citus. I’m excited to tell you that the latest release, Citus 10, is now available in preview on Hyperscale (Citus) and comes with spectacular new capabilities:

Columnar storage for Postgres: Compress your PostgreSQL and Hyperscale (Citus) tables to reduce storage cost and speed up your analytical queries!
Sharding on a single Citus node (Basic Tier): With Basic Tier, you can shard Postgres on a single node, so your application is “scale-out ready”. Also handy for trying out Hyperscale (Citus) at a much lower price point, starting at $0.27 USD/hour.^[1]
Joins and foreign keys between local PostgreSQL tables and Citus tables: Mix and match PostgreSQL and Hyperscale (Citus) tables with foreign keys and joins.
Function to change the way your tables are distributed: Redistribute your tables in a single step using new alter table functions.
Much more: Better naming, improved SQL & DDL support, simplified operations.

These new Citus 10 capabilities change what Hyperscale (Citus) can do for you in some fundamental (and useful) ways.

With Citus 10, Hyperscale (Citus) is no longer just about sharding Postgres: you can use the new Citus columnar storage feature to compress large data sets. And Citus is no longer just about multi-node clusters: with Basic Tier in Hyperscale (Citus), you can now shard on a single node to be “scale-out-ready”. Finally, Hyperscale (Citus) is no longer just about transforming Postgres into a distributed database: you can now mix regular (local) Postgres tables and distributed tables in the same Postgres database.

In short, Hyperscale (Citus) in Azure Database for PostgreSQL now empowers you to run Postgres at any scale.

Let’s dive in!

One of our favourite Postgres memorabilia is the PostgreSQL 9.2 race car poster with the signatures of all the committers from the PGCon auction in 2013. Since Citus 9.2, our open source team has been creating a new racecar image for each new Citus open source release. With Citus 10 giving you columnar, single node (Basic tier), & so much more, the Postgres elephant can now go to any scale!

Columnar storage for PostgreSQL with Hyperscale (Citus)

The data sizes of some new Hyperscale (Citus) customers are truly gigantic, which meant we needed a way to lower storage cost and get more out of the hardware. That is why we implemented columnar storage for Citus. Citus Columnar can give you compression ratios of 3x-10x or more, and even greater I/O reductions. The new Citus columnar feature is available in:

Citus 10 open source: you can download the latest Citus packages here
Hyperscale (Citus) in Azure Database for PostgreSQL: at the time of writing, the Citus 10 features are in preview in Hyperscale (Citus). So if you want to try out the new Citus columnar feature, you’ll want to turn the preview features on in the portal when provisioning a new Hyperscale (Citus) server group. Of course, depending on when you read this blog post, these Citus 10 features might already be GA in Hyperscale (Citus).

The best part: you can use columnar in Hyperscale (Citus) with or without the Citus scale-out features! More details about columnar table storage can be found in our Hyperscale (Citus) docs.

Our Citus engineering team has a long history with columnar storage in PostgreSQL, as we originally developed the cstore_fdw extension which offered columnar storage via the foreign data wrapper (fdw) API. PostgreSQL 12 introduced “table access methods”, which allows extensions to define custom storage formats in a much more native way.

Citus makes columnar storage available in PostgreSQL via the table access method APIs, which means that you can now create Citus columnar tables by simply adding USING columnar when creating a table:

CREATE TABLE order_history (…) USING columnar;

If you provision a row-based (“heap”) table that you’d like to later convert to columnar, you can do that too, using the alter_table_set_access_method function:

-- compress a table using columnar storage
SELECT alter_table_set_access_method('orders_2019', 'columnar');

When you use Citus columnar storage, you will typically see a 60-90% reduction in data size. In addition, Citus columnar will only read the columns used in the SQL query. This can give dramatic speed ups for I/O bound queries, and a big reduction in storage cost.

Compared to cstore_fdw, Citus columnar has a better compression ratio thanks to zstd compression. Citus columnar also supports rollback, streaming replication, archival, and pg_upgrade.

There are still a few limitations with Citus columnar to be aware of: update and delete are not yet supported, and it is best to avoid single-row inserts, since compression only works well in batches. We plan to address these limitations in future Citus releases, but you can also avoid them using partitioning.

[UPDATE] As of Citus 10.2, Citus columnar now supports `btree` and `hash` indexes on columnar tables.

If you partition time series tables by time, you can use row-based storage for recent partitions to enable single-row, update/delete/upsert and indexes—while using columnar storage to archive data that is no longer changing. To make this easy, we also added a function to compress all your old partitions in one go:

-- compress all partitions older than 7 days
CALL alter_old_partitions_set_access_method('order_history', now() – interval '7 days', 'columnar');

This procedure commits after every partition to release locks as quickly as possible. You can use pg_cron to run this new alter function as a nightly compression job.

To learn more, check out Jeff Davis’ blog post: Citus 10 brings columnar compression to Postgres. Jeff also created a video demo, if you’re a more visual person this Citus columnar demo might be a good way to get acquainted.

Starting with Basic Tier in Hyperscale (Citus)—to be “scale-out ready”

We often think of Hyperscale (Citus) as “worry-free Postgres”, because Citus takes away the one concern you may have when choosing Postgres as your database: reaching the limits of a single node. However, when you migrate a complex application from Postgres to Hyperscale (Citus), you may need to make some changes to your application to handle restrictions around unique- and foreign key-constraints and joins, since not every PostgreSQL feature has an efficient distributed implementation.

In Azure, the easiest way to scale your application on Postgres without ever facing the cost of migration (and be truly worry-free) is to use Hyperscale (Citus) from day one, when you first build your application. Applications built on Citus are always 100% compatible with regular PostgreSQL, so there is no risk of lock-in. The only downside of starting on Hyperscale (Citus) so far was the cost and complexity of running a distributed database cluster, but this changes in Citus 10. With Citus 10 and the new Basic tier in Hyperscale (Citus), you can now shard your Postgres tables on a single Citus node to make your database “scale-out-ready”.

To get started with Hyperscale (Citus) on a single node, this post about sharding Postgres with Basic tier is a good place to start. Be sure to enable preview features in the Azure portal when provisioning Azure Database for PostgreSQL—and then select the new “Basic Tier” feature that’s available in preview on Hyperscale (Citus). As of today, you can provision Basic tier for $0.27 USD/hour in US East 1. This means that you can try out Hyperscale (Citus) at a much lower price point: about ~8 hours of kicking the tires and you’ll only pay $2-3 USD.

Diagram 1: When provisioning the Hyperscale (Citus) deployment option in the Azure portal for Azure Database for PostgreSQL, you’ll now have two choices: Basic Tier and Standard Tier.

Once connected, you can create your first distributed table by running the following commands:

CREATE TABLE data (key text primary key, value jsonb not null);
SELECT create_distributed_table('data', 'key');

The create_distributed_table function will divide the table across 32 (hidden) shards that can be moved to new nodes when a single node is no longer sufficient.

You may experience some overhead from distributed query planning, but you will also see benefits from multi-shard queries being parallelized across cores. You can also make distributed, columnar tables to take advantage of both I/O and storage reduction and parallelism.

The biggest advantage of distributing Postgres tables with Basic tier in Hyperscale (Citus) is that your database will be ready to be scaled out using the Citus shard rebalancer.

Joins & foreign keys between PostgreSQL and Citus tables

With the new Basic Tier feature in Hyperscale (Citus) and the shard rebalancer, you can be ready to scale out by distributing your tables. However, distributing tables does involve certain trade-offs, such as extra network round trips when querying shards on worker nodes, and a few unsupported SQL features.

If you have a very large Postgres table and a data-intensive workload (e.g. the frequently-queried part of the table exceeds memory), then the performance gains from distributing the table over multiple nodes with Hyperscale (Citus) will vastly outweigh any downsides. However, if most of your other Postgres tables are small, then you might end up having to make additional changes without much additional benefit.

A simple solution would be to not distribute the smaller tables at all. In most Hyperscale (Citus) deployments, your application connects to a single coordinator node (which is usually sufficient), and the coordinator is a fully functional PostgreSQL node. That means you could organize your database as follows:

convert large tables into Citus distributed tables,
convert smaller tables that frequently JOIN with distributed tables into reference tables,
convert smaller tables that have foreign keys from distributed tables into reference tables,
keep all other tables as regular PostgreSQL tables local to the coordinator.

Diagram 2: Example of a data model where the really large table (clicks) is distributed. Because the Clicks table has a foreign key to Ads, we turn Ads into a reference table. Ads also has foreign keys to other tables, but we can keep those other tables (Campaigns, Publishers, Advertisers) as local tables on the coordinator.

That way, you can scale out CPU, memory, and I/O where you need it. And minimize application changes and other trade-offs where you don’t. To make this model work seamlessly, Citus 10 adds support for 2 important features:

foreign keys between local tables and reference tables
direct joins between local tables and distributed tables

With these new Citus 10 features in Hyperscale (Citus), you can mix and match PostgreSQL tables and Citus tables to get the best of both worlds without having to separate them in your data model.

Alter all the things!

When you distribute a Postgres table with Hyperscale (Citus), choosing your distribution column is an important step, since the distribution column (sometimes called the sharding key) determines which constraints you can create, how (fast) you can join tables, and more.

Citus 10 adds the alter_distributed_table function so you can change the distribution column, shard count, and co-location of a distributed table. This blog post walks through how when and why to use alter_distributed_table with Hyperscale (Citus).

-- change the distribution column to customer_id
SELECT alter_distributed_table('orders',
                               distribution_column := 'customer_id'); 

-- change the shard count to 120
SELECT alter_distributed_table('orders',
                               shard_count := 120);

-- Co-locate with another table
SELECT alter_distributed_table('orders',
                               distribution_column := 'product_id', 
                               colocate_with := 'products');

Internally, alter_distributed_table reshuffles the data between the worker nodes, which means it is fast and works well on very large tables. We expect this makes it much easier to experiment with distributing your tables without having to reload your data.

You can also use the alter_distributed_table function in production (it’s fully transactional!), but you do need to (1) make sure that you have enough disk space to store the table several times, and (2) make sure that your application can tolerate blocking all writes to the table for a while.

Many other features in Citus 10—now available in preview in Hyperscale (Citus)

And there’s more!

DDL support
More DDL commands work seamlessly on distributed Citus tables, including CREATE STATISTICS, ALTER TABLE .. SET LOGGED, and ALTER SCHEMA .. RENAME.
SQL support
Correlated subqueries can now be used in the SELECT part of the query, as long as the distributed tables are joined by their distribution column.
New views to see the state of your cluster: citus_tables and citus_shards
citus_tables view shows Citus tables and their distribution column, total size, and access method. The citus_shards view shows the names, locations, and sizes of individual shards.

Two easy ways to start playing with Citus 10

If you are as excited as we are and want to play with these new Citus 10 features, doing so is now easier than ever.

The Basic Tier in Hyperscale (Citus) makes it very cheap to get started with a managed Citus node in our Azure Database for PostgreSQL managed service. (There’s a Basic tier Quickstart in our Azure docs, too.)
And you can also run Citus open source on your laptop as a single Docker container! Not only is the single docker run command an easy way to try out Citus—it gives you functional parity between your local dev machine and using Citus in the cloud.

# run PostgreSQL with Citus on port 5500
docker run -d --name citus -p 5500:5432 -e POSTGRES_PASSWORD=mypassword citusdata/citus

# connect
psql -U postgres -d postgres -h localhost -p 5500

You can also check out our lovely new Getting started with Citus page for more resources on how to get started—my teammates have curated some good learning tools there, whether your preferred learning mode is reading, watching, or doing.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs