Performing ETL Using Inheritance in PostgreSQL

Good database maintenance includes not only performing the standard adding, updating, and deleting records, etc., but periodic edits to the table schema too. Operations such as adding, editing, and removing table columns are part of today’s life-cycle reality too as new functionality is constantly being added.

In quieter, and less demanding times, one could get away with schema updates with minimal impact by performing the operation during low load times. However, as all database systems have become more critical to the business bottom line, maintenance windows have become, by necessity, much smaller and more time-sensitive.

So the question is asked: How can one update a multi-terabyte table with near-zero downtime?

Looking deep into our Postgres bag of tricks, we look not at the most recent and advanced features but instead, we leverage a very old capability that’s been part of Postgres since when it was first released as an open source project, i.e. its object-relational capacity implementation of inheritance.

Use Case

Before going into the details of the solution let’s define just what we are trying to solve.

Strictly speaking; there’s two use cases that come to mind when using inheritance as the prime ETL data migration mechanism:

Removing table OIDs, such as when moving to Postgres version 12 and greater.
Performing DML/DDL operations which include:
- updating the data
- adding or removing table columns

Life starts getting complicated when dealing with issues such as updating data types. But we’ll talk about this at the end of the blog.

About Inheritance

Unlike object-oriented programming languages, where the child inherits attributes from the parent, in Postgres it’s the parent that has the ability to inherit from the child ( if only this was true in real-life ;-)). Thus a table column from a child has the potential of becoming available in the parent relation.

Consider the following snippet of code: Two parents and three children are created and populated with records:

BEGIN;
    drop table if exists father, mother cascade;

    create table father(c1 int, c2 int, c3 int);
    create table mother(c1 int, c2 int, c4 int);

-- notice although not declared column "c4" from mother is added to these tables
    create table son(c1 int, c2 int, c3 int) inherits (father,mother);
    create table daughter(c2 int, c4 int, c5 int) inherits (father,mother);

-- this table inherits only those columns from "father"
    create table cousin (c1 int, c2 int, c3 int, c6 int) inherits (father);
COMMIT;

BEGIN;

drop table if exists father, mother cascade;

create table father(c1 int, c2 int, c3 int);

create table mother(c1 int, c2 int, c4 int);

-- notice although not declared column "c4" from mother is added to these tables

create table son(c1 int, c2 int, c3 int) inherits (father,mother);

create table daughter(c2 int, c4 int, c5 int) inherits (father,mother);

-- this table inherits only those columns from "father"

create table cousin (c1 int, c2 int, c3 int, c6 int) inherits (father);

COMMIT;

BEGIN;
    insert into son values(1,1,1,1);
    insert into daughter values(2,2,2,2,2);
    insert into cousin values(3,3,3,3);
    insert into father values(10,10,10);
    insert into mother values(11,11,11);
COMMIT;

BEGIN;

insert into son values(1,1,1,1);

insert into daughter values(2,2,2,2,2);

insert into cousin values(3,3,3,3);

insert into father values(10,10,10);

insert into mother values(11,11,11);

COMMIT;

Columns declared in the parents must therefore exist in the child, i.e. they are merged. But notice those columns unique to the child are not necessarily propagated to the parents:

               Table "public.father"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 c1     | integer |           |          |
 c2     | integer |           |          |
 c3     | integer |           |          |
Number of child tables: 3 (Use d+ to list them.)

Table "public.father"

Column | Type | Collation | Nullable | Default

--------+---------+-----------+----------+---------

c1 | integer | | |

c2 | integer | | |

c3 | integer | | |

Number of child tables: 3 (Use d+ to list them.)

               Table "public.mother"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 c1     | integer |           |          |
 c2     | integer |           |          |
 c4     | integer |           |          |
Number of child tables: 2 (Use d+ to list them.)

Table "public.mother"

Column | Type | Collation | Nullable | Default

--------+---------+-----------+----------+---------

c1 | integer | | |

c2 | integer | | |

c4 | integer | | |

Number of child tables: 2 (Use d+ to list them.)

                Table "public.son"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 c1     | integer |           |          |
 c2     | integer |           |          |
 c3     | integer |           |          |
 c4     | integer |           |          |
Inherits: father,
          mother

Table "public.son"

Column | Type | Collation | Nullable | Default

--------+---------+-----------+----------+---------

c1 | integer | | |

c2 | integer | | |

c3 | integer | | |

c4 | integer | | |

Inherits: father,

mother

              Table "public.daughter"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 c1     | integer |           |          |
 c2     | integer |           |          |
 c3     | integer |           |          |
 c4     | integer |           |          |
 c5     | integer |           |          |
Inherits: father,
          mother

Table "public.daughter"

Column | Type | Collation | Nullable | Default

--------+---------+-----------+----------+---------

c1 | integer | | |

c2 | integer | | |

c3 | integer | | |

c4 | integer | | |

c5 | integer | | |

Inherits: father,

mother

Table "public.cousin"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 c1     | integer |           |          |
 c2     | integer |           |          |
 c3     | integer |           |          |
 c6     | integer |           |          |
Inherits: father

Table "public.cousin"

Column | Type | Collation | Nullable | Default

--------+---------+-----------+----------+---------

c1 | integer | | |

c2 | integer | | |

c3 | integer | | |

c6 | integer | | |

Inherits: father

And even though records populated in the child can be seen by the parent the reverse is NOT true when records are populated into the parent and are not seen by the child:

db02=# select * from father;
 c1 | c2 | c3
----+----+----
 10 | 10 | 10
  1 |  1 |  1
  2 |  2 |  2
  3 |  3 |  3

db02=# select * from father;

c1 | c2 | c3

----+----+----

10 | 10 | 10

1 | 1 | 1

2 | 2 | 2

3 | 3 | 3

db02=# select * from mother;
 c1 | c2 | c4
----+----+----
 11 | 11 | 11
  1 |  1 |  1
  2 |  2 |  2

db02=# select * from mother;

c1 | c2 | c4

----+----+----

11 | 11 | 11

1 | 1 | 1

2 | 2 | 2

db02=# select * from son;
 c1 | c2 | c3 | c4
----+----+----+----
  1 |  1 |  1 |  1

db02=# select * from son;

c1 | c2 | c3 | c4

----+----+----+----

1 | 1 | 1 | 1

db02=# select * from daughter;
 c1 | c2 | c3 | c4 | c5
----+----+----+----+----
  2 |  2 |  2 |  2 |  2

db02=# select * from daughter;

c1 | c2 | c3 | c4 | c5

----+----+----+----+----

2 | 2 | 2 | 2 | 2

db02=# select * from cousin;
 c1 | c2 | c3 | c6
----+----+----+----
  3 |  3 |  3 |  3

db02=# select * from cousin;

c1 | c2 | c3 | c6

----+----+----+----

3 | 3 | 3 | 3

Developing The ETL/Migration Model

Performing data migration under production conditions should take into consideration these four (4) distinct query operations:

SELECT from both the target and source tables at the same time.
UPDATE and/or DELETE records from both the target and source tables.
INSERT new records into the target table.
Moving data from the source to the target tables.

For the sake of discussion we’ll demonstrate using one source and target table respectively inheriting from a single parent:

create table parent(c1 int primary key, c2 int, c3 int);
create table source(like parent including all) inherits (parent);
create table target(like parent including all) inherits (parent);

create table parent(c1 int primary key, c2 int, c3 int);

create table source(like parent including all) inherits (parent);

create table target(like parent including all) inherits (parent);

Querying Both Target And Source Tables

Inheritance makes the SELECT query a straightforward operation. This query checks all the tables for the queried record(s):

explain select * from parent;

1	explain select * from parent;

                               QUERY PLAN
--------------------------------------------------------------------------
Append (cost=0.00..81.21 rows=4081 width=12)
-> Seq Scan on parent parent_1 (cost=0.00..0.00 rows=1 width=12)
-> Seq Scan on source parent_2 (cost=0.00..30.40 rows=2040 width=12)
-> Seq Scan on target parent_3 (cost=0.00..30.40 rows=2040 width=12)

QUERY PLAN

--------------------------------------------------------------------------

Append (cost=0.00..81.21 rows=4081 width=12)

-> Seq Scan on parent parent_1 (cost=0.00..0.00 rows=1 width=12)

-> Seq Scan on source parent_2 (cost=0.00..30.40 rows=2040 width=12)

-> Seq Scan on target parent_3 (cost=0.00..30.40 rows=2040 width=12)

UPDATE and/or DELETE Records

Similarly to SELECT queries, one doesn’t have to worry about editing the existing application’s DML operation(s) when performing UPDATE and DELETE. Again, notice how both source and target tables are queried as well as the parent:

explain update parent set c2=0 where c1=1;

1	explain update parent set c2=0 where c1=1;

                                   QUERY PLAN
---------------------------------------------------------------------------------
 Update on parent  (cost=0.00..16.34 rows=3 width=18)
   Update on parent
   Update on source
   Update on target
   ->  Seq Scan on parent  (cost=0.00..0.00 rows=1 width=18)
         Filter: (c1 = 1)
   ->  Index Scan using source_pkey on source  (cost=0.15..8.17 rows=1 width=18)
         Index Cond: (c1 = 1)
   ->  Index Scan using target_pkey on target  (cost=0.15..8.17 rows=1 width=18)
         Index Cond: (c1 = 1)

QUERY PLAN

---------------------------------------------------------------------------------

Update on parent (cost=0.00..16.34 rows=3 width=18)

Update on parent

Update on source

Update on target

-> Seq Scan on parent (cost=0.00..0.00 rows=1 width=18)

Filter: (c1 = 1)

-> Index Scan using source_pkey on source (cost=0.15..8.17 rows=1 width=18)

Index Cond: (c1 = 1)

-> Index Scan using target_pkey on target (cost=0.15..8.17 rows=1 width=18)

Index Cond: (c1 = 1)

INSERT New Records Into The Target Table

The thing to keep in mind about INSERT is that without a redirect mechanism all records are inserted into the parent.

Since everybody already knows about triggers, I thought it would be fun to use a REWRITE RULE instead:

create rule child_insert as
on
    insert to parent
do instead
    insert into target values (NEW.*);

create rule child_insert as

insert to parent

do instead

insert into target values (NEW.*);

                                  Table "public.parent"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 c1     | integer |           | not null |         | plain   |              |
 c2     | integer |           |          |         | plain   |              |
 c3     | integer |           |          |         | plain   |              |
Indexes:
    "parent_pkey" PRIMARY KEY, btree (c1)
Rules:
    child_insert AS
    ON INSERT TO parent DO INSTEAD  INSERT INTO target (c1, c2, c3)
  VALUES (new.c1, new.c2, new.c3)
Child tables: source,
              target

Table "public.parent"

--------+---------+-----------+----------+---------+---------+--------------+-------------

Indexes:

"parent_pkey" PRIMARY KEY, btree (c1)

Rules:

child_insert AS

ON INSERT TO parent DO INSTEAD INSERT INTO target (c1, c2, c3)

VALUES (new.c1, new.c2, new.c3)

Child tables: source,

target

And here’s our validation; notice how the INSERT is redirected from parent to target:

EXPLAIN insert into parent (c1,c2,c3) values (1,1,1);

1	EXPLAIN insert into parent (c1,c2,c3) values (1,1,1);

                     QUERY PLAN
-----------------------------------------------------
 Insert on target  (cost=0.00..0.01 rows=1 width=12)
   ->  Result  (cost=0.00..0.01 rows=1 width=12)

QUERY PLAN

-----------------------------------------------------

Insert on target (cost=0.00..0.01 rows=1 width=12)

-> Result (cost=0.00..0.01 rows=1 width=12)

Moving Records Between Source And Target Tables

It’s time to introduce the last mechanism needed to perform the actual data migration. Essentially, the data is moved in batches otherwise, if you can afford the downtime of moving your records in one very large transaction, this dog and pony show is a bit redundant.

In the real world we need to anticipate multiple processes attempting simultaneous EXCLUSIVE LOCKS. In the case that one or more records are locked by another operation the following example demonstrates how one can simply skip over them:

--
-- EX: insert a single record in table "source"
--
insert into source (c1,c2,c3) values (2,2,2);

-- move 1,000 records at a time
-- from table "source" to "target"
--
with a as (select * from source for update skip locked limit 1000),
     b as (delete from source using a where c1=a.c1
           returning    source.c1,
                        source.c2,
                        source.c3)
insert into target select * from b;

-- EX: insert a single record in table "source"

insert into source (c1,c2,c3) values (2,2,2);

-- move 1,000 records at a time

-- from table "source" to "target"

with a as (select * from source for update skip locked limit 1000),

b as (delete from source using a where c1=a.c1

returning source.c1,

source.c2,

source.c3)

insert into target select * from b;

                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Insert on target  (cost=27.92..41.52 rows=680 width=12) (actual time=0.082..0.082 rows=0 loops=1)
   CTE b
     ->  Delete on source  (cost=9.42..27.92 rows=680 width=6) (actual time=0.050..0.053 rows=1 loops=1)
           ->  Bitmap Heap Scan on source  (cost=9.42..27.92 rows=680 width=6) (actual time=0.021..0.023 rows=1 loops=1)
                 Recheck Cond: (c1 >= 0)
                 Heap Blocks: exact=1
                 ->  Bitmap Index Scan on source_pkey  (cost=0.00..9.25 rows=680 width=0) (actual time=0.010..0.011 rows=1 loops=1)
                       Index Cond: (c1 >= 0)
   ->  CTE Scan on b  (cost=0.00..13.60 rows=680 width=12) (actual time=0.054..0.057 rows=1 loops=1)
 Planning Time: 0.237 ms
 Execution Time: 0.173 ms

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

Insert on target (cost=27.92..41.52 rows=680 width=12) (actual time=0.082..0.082 rows=0 loops=1)

CTE b

-> Delete on source (cost=9.42..27.92 rows=680 width=6) (actual time=0.050..0.053 rows=1 loops=1)

-> Bitmap Heap Scan on source (cost=9.42..27.92 rows=680 width=6) (actual time=0.021..0.023 rows=1 loops=1)

Recheck Cond: (c1 >= 0)

Heap Blocks: exact=1

-> Bitmap Index Scan on source_pkey (cost=0.00..9.25 rows=680 width=0) (actual time=0.010..0.011 rows=1 loops=1)

Index Cond: (c1 >= 0)

-> CTE Scan on b (cost=0.00..13.60 rows=680 width=12) (actual time=0.054..0.057 rows=1 loops=1)

Planning Time: 0.237 ms

Execution Time: 0.173 ms

Putting It All Together

It’s time to demonstrate a Proof Of Concept using pgbench.

SETUP

Initialize database db02:

dropdb --if-exists db02
createdb db02
pgbench -s 10 -i db02

dropdb --if-exists db02

createdb db02

pgbench -s 10 -i db02

                          List of relations
 Schema |       Name       | Type  |  Owner   |  Size   | Description
--------+------------------+-------+----------+---------+-------------
 public | pgbench_accounts | table | postgres | 128 MB  |
 public | pgbench_branches | table | postgres | 40 kB   |
 public | pgbench_history  | table | postgres | 0 bytes |
 public | pgbench_tellers  | table | postgres | 40 kB   |

List of relations

--------+------------------+-------+----------+---------+-------------

Create the tables parent and child.

NOTE: In order to demonstrate data migration from a deprecated table, table pgbench_accounts is altered by adding OIDs.

create table parent (like pgbench_accounts including all) without oids;
create table child (like pgbench_accounts including all) inherits (parent) without oids;

alter table pgbench_accounts set with oids, inherit parent;

alter table pgbench_accounts rename to pgbench_accounts_deprecated;
alter table parent rename to pgbench_accounts;

create table parent (like pgbench_accounts including all) without oids;

create table child (like pgbench_accounts including all) inherits (parent) without oids;

alter table pgbench_accounts set with oids, inherit parent;

alter table pgbench_accounts rename to pgbench_accounts_deprecated;

alter table parent rename to pgbench_accounts;

TEST

This query is at the heart of the solution. Any exclusively locked record that it tries to move is automatically skipped and a new attempt can be made the next time this script is invoked.

with a as (select * from pgbench_accounts_deprecated order by 1 for update skip locked limit 10),
     b as (delete
                from pgbench_accounts_deprecated
                using a
                where pgbench_accounts_deprecated.aid=a.aid
                returning pgbench_accounts_deprecated.aid,
                          pgbench_accounts_deprecated.bid,
                          pgbench_accounts_deprecated.abalance,
                          pgbench_accounts_deprecated.filler)
insert into child select * from b;

with a as (select * from pgbench_accounts_deprecated order by 1 for update skip locked limit 10),

b as (delete

from pgbench_accounts_deprecated

using a

where pgbench_accounts_deprecated.aid=a.aid

returning pgbench_accounts_deprecated.aid,

pgbench_accounts_deprecated.bid,

pgbench_accounts_deprecated.abalance,

pgbench_accounts_deprecated.filler)

insert into child select * from b;

VALIDATE

explain analyze select * from pgbench_accounts order by 1 limit 13;

1	explain analyze select * from pgbench_accounts order by 1 limit 13;

                                          QUERY PLAN
--------------------------------------------------------------------------------------------------
 Limit  (cost=0.72..1.45 rows=13 width=97) (actual time=0.012..0.016 rows=13 loops=1)
   ->  Merge Append  (cost=0.72..56212.42 rows=1000211 width=97) (actual time=0.011..0.013 rows=13 loops=1)
         Sort Key: pgbench_accounts.aid
         ->  Index Scan using parent_pkey on pgbench_accounts  (cost=0.12..8.14 rows=1 width=352) (actual time=0.002..0.002 rows=0 loops=1)
         ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts_deprecated  (cost=0.42..43225.43 rows=1000000 width=97) (actual time=0.006..0.007 rows=3 loops=1)
         ->  Index Scan using child_pkey on child  (cost=0.14..51.30 rows=210 width=352) (actual time=0.002..0.003 rows=10 loops=1)
 Planning Time: 0.084 ms
 Execution Time: 0.030 ms

QUERY PLAN

--------------------------------------------------------------------------------------------------

Limit (cost=0.72..1.45 rows=13 width=97) (actual time=0.012..0.016 rows=13 loops=1)

-> Merge Append (cost=0.72..56212.42 rows=1000211 width=97) (actual time=0.011..0.013 rows=13 loops=1)

Sort Key: pgbench_accounts.aid

-> Index Scan using parent_pkey on pgbench_accounts (cost=0.12..8.14 rows=1 width=352) (actual time=0.002..0.002 rows=0 loops=1)

-> Index Scan using pgbench_accounts_pkey on pgbench_accounts_deprecated (cost=0.42..43225.43 rows=1000000 width=97) (actual time=0.006..0.007 rows=3 loops=1)

-> Index Scan using child_pkey on child (cost=0.14..51.30 rows=210 width=352) (actual time=0.002..0.003 rows=10 loops=1)

Planning Time: 0.084 ms

Execution Time: 0.030 ms

MIGRATION SCRIPT

The query has been incorporated into this script moving 1,000 records every 5 seconds.

#!/bin/bash

set -e

export PGHOST=/tmp PGPORT=10011 PGDATABASE=db02 PGUSER=postgres

SLEEP=5
REC=1000

SQL="
with a as (select * from pgbench_accounts_deprecated order by 1 for update skip locked limit $REC),
     b as (delete
                from pgbench_accounts_deprecated
                using a
                where pgbench_accounts_deprecated.aid=a.aid
                returning pgbench_accounts_deprecated.aid,
                          pgbench_accounts_deprecated.bid,
                          pgbench_accounts_deprecated.abalance,
                          pgbench_accounts_deprecated.filler)
insert into child select * from b;

with a(count_child)    as (select count(*) from child),
     b(count_accounts) as (select count(*) from pgbench_accounts_deprecated)
select a.count_child, b.count_accounts from a,b;
"

while true
do
    echo "--- $(date): Executing Query, moving $REC records now ... ---"
    psql <<<"$SQL"
    echo "sleeping: $SLEEP seconds ..." && sleep $SLEEP
done

#!/bin/bash

set -e

export PGHOST=/tmp PGPORT=10011 PGDATABASE=db02 PGUSER=postgres

SLEEP=5

REC=1000

SQL="

with a as (select * from pgbench_accounts_deprecated order by 1 for update skip locked limit $REC),

b as (delete

from pgbench_accounts_deprecated

using a

where pgbench_accounts_deprecated.aid=a.aid

returning pgbench_accounts_deprecated.aid,

pgbench_accounts_deprecated.bid,

pgbench_accounts_deprecated.abalance,

pgbench_accounts_deprecated.filler)

insert into child select * from b;

with a(count_child) as (select count(*) from child),

b(count_accounts) as (select count(*) from pgbench_accounts_deprecated)

select a.count_child, b.count_accounts from a,b;

while true

echo "--- $(date): Executing Query, moving $REC records now ... ---"

psql <<<"$SQL"

echo "sleeping: $SLEEP seconds ..." && sleep $SLEEP

done

BENCHMARKING

All the while the aforementioned script is active pgbench is running the bench-marking.

#
# doesn't block
#
pgbench -c 2 -j 4 --protocol=simple -T 120 db02

# doesn't block

pgbench -c 2 -j 4 --protocol=simple -T 120 db02

CAVEAT

This is a simple and powerful method but there are limitations: whereas common columns between tables must be of the same datatype.

For example, if column c1 in the table source is of datatype int and you want to migrate the data into table target with the same column c1 but with a datatype bigint then this method won’t work. An alternate solution however could take advantage of Updatable Views, which you can read more about here and using the appropriate triggers and rewrite rules.

<!-- wp:paragraph --><!-- /wp:paragraph --> <strong>Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.</strong>

1	<!-- wp:paragraph --><!-- /wp:paragraph --> <strong>Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.</strong>

Download Percona Distribution for PostgreSQL Today!

0 Comments

Inline Feedbacks

View all comments

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Performing ETL Using Inheritance in PostgreSQL

Use Case

About Inheritance

Developing The ETL/Migration Model

Querying Both Target And Source Tables

UPDATE and/or DELETE Records

INSERT New Records Into The Target Table

Moving Records Between Source And Target Tables

Putting It All Together

SETUP

TEST

VALIDATE

MIGRATION SCRIPT

BENCHMARKING

CAVEAT

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Why MariaDB Is “Better” Than MySQL

PostgreSQL Database Security Best Practices

Did MyDumper LIKE Triggers?

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation