This is a guest post co-authored by Fabrízio Mello and Álvaro Hernández, from OnGres.

Envoy 1.15 was released on July 6th, and it included an interesting mention in the release notes: a new plugin, for Postgres! This post will introduce why this plugin was developed, what functionality it currently implements, and what the roadmap looks like for future releases. For the TL;DR: head to the plugin documentation to see what it can do as of today and how to use it.

The main goal of the plugin is to be able to decode and interpret Postgres’ wire protocol format and extract valuable information from the Postgres traffic. This, in turn, may allow us to extract metrics, offload work from the Postgres server, or route Postgres traffic to different nodes. And all without requiring any configuration, extension or work on the Postgres server!

Motivation and initial design of the plugin

All the initial motivation and initial design considerations were explained in detail in issue #9107. Quoting from that issue:

The main goal of this project is to add network observability to PostgreSQL. By decoding the PostgreSQL wire protocol within the Envoy filter, it would be possible to extract relevant information about the activity and the performance of a Postgres server, like the number of queries per unit of time, types of queries and the performance of those queries.

The usual Postgres DBA may argue that this information you can already obtain from Postgres. Multiple monitoring tools and queries to Postgres metainformation may allow us to capture it. While certainly true, there are several problems with these approaches:

The main advantage of adding Postgres observability to the Envoy proxy is that it is totally transparent to Postgres: no installation, configuration, or restart is required on Postgres servers, and it adds no load to them. Additionally, Envoy supports counters and histograms, so metrics can be captured continuously.

The plugin features in Envoy 1.15

Basically this first release of the plugin adds observability to Postgres. In essence, it decodes the Postgres traffic that goes through it and publishes metrics to Prometheus (or any metrics sink that you may have configured for Envoy). Metrics like:

This is detailed in the Postgres Plugin documentation.

Current limitations

The current implementation is not able to decode the SSL encrypted traffic, and thus metrics are only exposed for non-SSL connections.

How to use it

The Postgres proxy filter should be chained with the TCP proxy as shown in the configuration example below:

...
filter_chains:
- filters:
  - name: envoy.filters.network.postgres_proxy
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.filters.network.postgres_proxy.v3alpha.PostgresProxy
      stat_prefix: postgres
  - name: envoy.tcp_proxy
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy
      stat_prefix: tcp
      cluster: postgres_cluster

To avoid YAML indentation issues you can download the configuration example file.

We should pay attention mostly to the static_resources section, which is the place we’ll configure the upstream connection by setting up the clusters section pointing Envoy to the Postgres server. And then we need to set up the listeners section creating a filter chain between TCP Proxy and Postgres Proxy filter.

According to the example above we’re using the local IP address 127.0.0.1 port 54322 for Envoy listener, and port 5432 for Postgres considering it’s already running.

Following these simple steps we’ll be able to get some Postgres metrics from Envoy:

  1. Running Envoy as a frontend proxy for Postgres:
$ getenvoy run standard:1.15.0 -- -c postgres.yaml
  1. Getting Postgres filter statistics using curl:

➜ curl -s http://localhost:8000/stats | grep egress_postgres

postgres.egress_postgres.errors: 0
postgres.egress_postgres.errors_error: 0
postgres.egress_postgres.errors_fatal: 0
postgres.egress_postgres.errors_panic: 0
postgres.egress_postgres.errors_unknown: 0
postgres.egress_postgres.messages: 0
postgres.egress_postgres.messages_backend: 0
postgres.egress_postgres.messages_frontend: 0
postgres.egress_postgres.messages_unknown: 0
postgres.egress_postgres.notices: 0
postgres.egress_postgres.notices_debug: 0
postgres.egress_postgres.notices_info: 0
postgres.egress_postgres.notices_log: 0
postgres.egress_postgres.notices_notice: 0
postgres.egress_postgres.notices_unknown: 0
postgres.egress_postgres.notices_warning: 0
postgres.egress_postgres.sessions: 0
postgres.egress_postgres.sessions_encrypted: 0
postgres.egress_postgres.sessions_unencrypted: 0
postgres.egress_postgres.statements: 0
postgres.egress_postgres.statements_delete: 0
postgres.egress_postgres.statements_insert: 0
postgres.egress_postgres.statements_other: 0
postgres.egress_postgres.statements_select: 0
postgres.egress_postgres.statements_update: 0
postgres.egress_postgres.transactions: 0
postgres.egress_postgres.transactions_commit: 0
postgres.egress_postgres.transactions_rollback: 0
  1. Generating some workload to Postgres using pgbench tool:
➜ pgbench -i -s10 -d "postgresql://postgres@localhost:54322/postgres?sslmode=disable"
dropping old tables...
creating tables...
generating data...
100000 of 1000000 tuples (10%) done (elapsed 0.06 s, remaining 0.58 s)
200000 of 1000000 tuples (20%) done (elapsed 0.13 s, remaining 0.51 s)
300000 of 1000000 tuples (30%) done (elapsed 0.19 s, remaining 0.45 s)
400000 of 1000000 tuples (40%) done (elapsed 0.26 s, remaining 0.39 s)
500000 of 1000000 tuples (50%) done (elapsed 0.32 s, remaining 0.32 s)
600000 of 1000000 tuples (60%) done (elapsed 0.39 s, remaining 0.26 s)
700000 of 1000000 tuples (70%) done (elapsed 0.45 s, remaining 0.19 s)
800000 of 1000000 tuples (80%) done (elapsed 0.51 s, remaining 0.13 s)
900000 of 1000000 tuples (90%) done (elapsed 0.58 s, remaining 0.06 s)
1000000 of 1000000 tuples (100%) done (elapsed 0.65 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done.

➜ pgbench -c50 -n -T3600 
"postgresql://postgres@localhost:54322/postgres?sslmode=disable"

WARNING!! In order to properly decode the Postgres raw protocol we should disable the SSL because the current version of the filter is unable to decode encrypted traffic.

  1. Check again the Postgres statistics:
➜ curl -s http://localhost:8000/stats | grep egress_postgres | grep -v ': 0'
postgres.egress_postgres.messages: 58735920
postgres.egress_postgres.messages_backend: 40859932
postgres.egress_postgres.messages_frontend: 17875990
postgres.egress_postgres.sessions: 51
postgres.egress_postgres.sessions_unencrypted: 51
postgres.egress_postgres.statements: 17875889
postgres.egress_postgres.statements_insert: 2553678
postgres.egress_postgres.statements_other: 5107389
postgres.egress_postgres.statements_select: 2553720
postgres.egress_postgres.statements_update: 7661101
postgres.egress_postgres.transactions: 17875888
postgres.egress_postgres.transactions_commit: 17875888

During the workload simulation using the pgbench tool the prometheus scrapper was getting metrics from Envoy accessing the endpoint http://localhost/8080/stats/prometheus. A simple configuration was used for scrapping on Prometheus:

scrape_configs:
 – job_name: ‘envoy’                                                                                                                     
scrape_interval: 5s
metrics_path: /stats/prometheus
static_configs:
– targets: [‘localhost:8000’]

Below we’ll demonstrate some graph examples using simple promql queries:

1. Transactions per second (TPS)

Graph example using promql queries showing transaction per second

2. Reads and Writes (statements) per second

Graph example using promql queries showing reads and writes (statement) per second

3. Statements per second

Graph example using promql queries showing statements per second

4. Sessions per second

Graph example using promql queries showing sessions per second

What you can do with this plugin today

Add amazing metrics to your Grafana dashboards as demonstrated before! The information already exported is quite useful for capacity planning and performance troubleshooting, but also to detect query errors and invalid patterns.

We’re aware of several teams already working in integrating this extension to improve monitoring and adding Envoy proxying to their Postgres servers. It has also been deeply integrated into StackGres, an open-source Postgres-on-Kubernetes platform that leverages Envoy to proxy all traffic to Postgres and export the performance metrics to Prometheus.

We’d like to know what cool uses you are considering for this plugin, please spread the word!

The road ahead

But this is just the beginning. There are many expected future features to be developed for this plugin, other than observability. However, a staged design was proposed in which individual features will be released in successive versions.  Issue #9107 already contains a glance of some of them (like aggregated query histograms or tracking individual query performance). Others are already being developed as of now, like:

Stay tuned!

Acknowledgments

We’d like to thank Matt Klein in first place. Not only for creating Envoy itself, but also for believing in this plugin idea and connecting the necessary team that created this first release.

And thank to Dhi Aurrahman for the initial guidance creating the first POC of this filter, and Christoph Pakulski for all hard working polishing and turning the code acceptable for merging into Envoy.

Finally we’d like to thank our company, OnGres, for sponsoring our work on this project and committing it to be open source as part of Envoy.