Categories
Uncategorized

Data consistency issues in logical replication

In contrast to physical (streaming) replication, write access on subscriber node is not disabled. This means that DML and DDL statements like INSERT, UPDATE, DELETE or ALTER can be executed on subscriber. As the replication is unidirectional, however, these changes won’t be replicated back to publisher. This behavior can lead to problems – the replication can stop with all its consequences or the data can become out of sync. In this post, I’d like to share some tips how to deal with such problems.

Scenario 1: data INSERTed on subscriber side

Data can be inserted into a replication target table without any problems, provided that a row with identical REPLICA IDENTITY wont’ be subsequently inserted on publisher node. For example, consider the following table:

CREATE TABLE mydata(id integer primary key, value integer, description varchar);

its REPLICA IDENTITY will be using the “id” primary key, which is the default.

You can safely execute the following INSERT on subscriber:

INSERT INTO mydata VALUES (100, 1024, 'test');

…but if you execute an INSERT with identical primary key value on the publisher,

INSERT INTO mydata VALUES (100, 65536, 'test on publisher');

you will get an error on subscriber node:

ERROR: duplicate key value violates unique constraint "mydata_pkey"
DETAIL: Key (id) = (100) already exists.

The replication will stop, and WAL segments will start to pile on publisher. Luckily, the log contains all the required information: you have to DELETE the problematic row on the subscriber:

DELETE FROM mydata WHERE id=100;

and the replication will restart automatically.

Scenario 2: data UPDATEd on subscriber side

As long as required privileges are granted, UPDATE statements can be executed on subscriber without any error or warning. The replication will not be halted. The publisher doesn’t know about the change, and as the result, the values in the same row can differ between publisher and subscriber. But if the same row gets updated on publisher, the whole row will be overwritten on subscriber, and the change will be gone – even if the changes were made to different columns.

Scenario 3: data DELETEd on subscriber side

This kind of inconsistency also won’t cause replication to stop, and if your log_min_messages configuration parameter is set to lower value than “debug1”, chances are that you won’t notice it. If you do log debug messages and update/delete a row on publisher which doesn’t exist on subscriber, you should expect following log entries on the subscriber node:

DEBUG: logical replication did not find row for update in replication target relation "mydata"
DEBUG: logical replication did not find row for delete in replication target relation "mydata"

Sometimes keeping smaller dataset on subscriber is desired and such messages can be safely ignored. But what if it’s an inconsistency and has to be resolved? Unfortunately, native logical replication – unlike pglogical – doesn’t provide tools to resynchronize the table. But there is a workaround…

How to resynchronize a table in PostgreSQL logical replication?

Using officially documented tools only, you can either:

  1. Lock the table for writes on publisher, dump the data and copy this dump to subscriber, truncate table on subscriber, restore a data dump on subscriber and remove the lock on publisher.
  2. Exclude the table from current publication, truncate table on subscriber, and create a new publication-subscription pair.

But if you don’t mind editing the system catalogs by hand, there is another possibility…

WARNING: black magic ahead. Proceed at your own risk. I’m not resposible for any data loss or corruption resulting from following this advice. Always have a backup in case things go wrong. You have been warned.

The state of logically replicated tables is kept in a pg_subscription_rel table. The “srsubstate” column indicates if the table is being initialized, data is being copied, synchronized or if the replication is ready. When you change its value to ‘i’, meaning “initial”, you can force the initial sync to be started. So, in order to perform a resync a table called “mydata”, execute following statements on subscriber:

TRUNCATE TABLE mydata;
UPDATE pg_subscription_rel SET srsubstate = 'i' WHERE srrelid = (SELECT relid FROM pg_stat_user_tables WHERE schemaname='public' AND relname='mydata');

(if the table is in another schema, replace “public” with relevant schema name)

After a while, check the pg_subscription_rel table – the state should automatically change to ‘d’, next to ‘s’ and finally ‘r’, meaning that the resync is complete. If the state hangs at ‘d’, check the logs and pg_stat_replication at publisher node – if the replication fails due to duplicate primary key value, purge the data on subscriber, check if the row count is actually zero, and repeat the pg_subscription_rel update.