Use PostgreSQL® DOMAIN rules to validate columns of data

Learn how to validate your data as it goes into your databases to improve data quality

When dealing with data inputs, there's always one or more fields that need strict validation: a T-shirt size can accept only a few precise values, a shoe size should be within a range of numbers, a phone number should follow a precise pattern. It's good to perform these checks at the application level, for example as a part of a form input, but for data consistency it's best to validate that data when adding it to the database as well.

PostgreSQL® allows these checks to be performed in several ways, like using table constraints. In this article, we'll talk about the DOMAIN feature, which defines column validation centrally for use across multiple database tables. We'll showcase the power and the limits of the DOMAIN approach, using some fairly basic examples of verifying Italian addresses in a table.

Centralized vs per-table approach

As mentioned before, the DOMAIN option allows defining the constraints of a column type once and reusing it across several tables. The benefit of a unique definition are evident: there's one truth and one point of contact. Needing to define the same constraints across several tables could result in human error or misalignment during constraint evolution.

On the other side, having a unique definition can be tricky in cases when different tables need to evolve such a definition in different ways or across different times. In such cases, having a unique point of contact can become dangerous. Check the examples below regarding altering and dropping DOMAIN definitions for more info.

Pick a database

For our test case, we can use any kind of PostgreSQL, it being on-premises, Docker, or cloud, since the DOMAIN function is native to PostgreSQL and has been available for a long time. For the purpose of the blog we'll use Aiven for PostgreSQL®. Sign up for an Aiven account if you don't have one already and install the Aiven CLI.

You'll need to authenticate to the CLI before you can create your first service and once authenticated, you can create a service:

avn service create demo-pg-italian-addresses \ --service-type pg \ --plan hobbyist \ --cloud google-europe-west3

The above call creates an Aiven for PostgreSQL service (--service-type pg) named demo-pg-italian-addresses, using the smallest hobbyist plan on the google-europe-west3 region.
We can wait for the service to be up and running with:

avn service wait demo-pg-italian-addresses

And connect to it (this runs psql for you, so that needs to be installed) with:

avn service cli demo-pg-italian-addresses

Define the allowed Italian street values

For the purposes of this demonstration, let's say Italian streets begin with three indicators: Via, Viale, and Piazza. In reality there are more ways to indicate streets in Italy, but let's keep it simple for now.

Let's use the DOMAIN feature to create a domain for these three values, ITALIAN_ADDRESS.

CREATE DOMAIN ITALIAN_ADDRESS TEXT NOT NULL CONSTRAINT ITALIAN_ADDRESS_CHECK CHECK(UPPER(VALUE) ~ '^(VIA|VIALE|PIAZZA) [A-Z \-]+ [0-9]+[A-Z]*$');

If we check the details of the above command we can see:

  • The name is ITALIAN_ADDRESS
  • The data type is TEXT
  • There is a NOT NULL constraint forcing the column to never be empty
  • There is a CONSTRAINT named ITALIAN_ADDRESS_CHECK. This performs a check, based on a regular expression, ensuring the address:
    • Starts with one of VIA, VIALE or PIAZZA identifying the type of street/square
    • Is followed by a space and a string of A-Z, space or - characters identifying the street/square name
    • Is followed by a space one or more numbers (+) identifying the building number
    • Is followed by zero or more letters (*) identifying the block within the building

Use the domain within a table definition

Now that we've defined the domain, we can use it when creating a table. Let's create the table ADDRESSES to host our customer information, such as the customer's name and their address:

CREATE TABLE ADDRESSES( ID SERIAL, NAME TEXT, ADDRESS ITALIAN_ADDRESS );

In the above SQL, the ADDRESS column refers to the ITALIAN_ADDRESS domain. Now we can insert rows, checking that the domain rules are obeyed:

INSERT INTO ADDRESSES(NAME, ADDRESS) VALUES ('UGA', 'PIAZZA DEL CORSO 8A');

The above insert is successful, since it follows the domain definition. But what happens if we try something different, like an address starting with STRADA?

INSERT INTO ADDRESSES(NAME, ADDRESS) VALUES ('UGO', 'STRADA NASCOSTA 5');

We get an error, since the domain check is failing:

ERROR: value for domain italian_address violates check constraint "italian_address_check"

Our domain is working as expected and we can now use it in other table definitions as well. For example if we have a SHOPS table that needs to contain a validated address, we can define it as SHOPS ITALIAN_ADDRESS. This is the beauty of the DOMAIN feature: one central definition, multiple usages.

Set a default value

The domain definition prohibits null values in the column, but we might want to allow an insert that skips the column definition by setting a domain default value:

ALTER DOMAIN ITALIAN_ADDRESS SET DEFAULT 'UNKNOWN ADDRESS';

Now, if we try to add a new customer without any associated address, we might expect a successful insert:

INSERT INTO ADDRESSES(NAME) VALUES ('LUIGI');

Instead we get an error:

ERROR: value for domain italian_address violates check constraint "italian_address_check"

The error is due to the fact that our default value UNKNOWN ADDRESS is not compliant with the constraint defined in the domain (it doesn't start with VIA, VIALE or PIAZZA). If we switch the default to a compatible value like VIA UNKNOWN ADDRESS 1:

ALTER DOMAIN ITALIAN_ADDRESS SET DEFAULT 'VIA UNKNOWN ADDRESS 1';

We can successfully push the customer information without the address now:

INSERT INTO ADDRESSES(NAME) VALUES ('LUIGI');

Checking the data in the ADDRESSES table with:

SELECT * FROM ADDRESSES;

Will show the row for LUIGI with the default address set.

id | name | address ----+-------+----------------------- 1 | UGA | PIAZZA DEL CORSO 8A 4 | LUIGI | VIA UNKNOWN ADDRESS 1 (2 rows)

Note that there is a hierarchy for default values: Data type default -> Domain default -> Table column default.

The data type default value can be overridden by the domain default and the domain default can be overridden by the table column default.

Pay attention when altering domains

Changing domains already in use can be tricky. For example, let's say that tomorrow italy decides that all PIAZZA addresses should be renamed to PIZZA, so we need to update our domain to reflect this. First we remove the old constraint:

ALTER DOMAIN ITALIAN_ADDRESS DROP CONSTRAINT ITALIAN_ADDRESS_CHECK;

Then add the new check, changing PIAZZA to PIZZA:

ALTER DOMAIN ITALIAN_ADDRESS ADD CONSTRAINT ITALIAN_ADDRESS_NEW_CHECK CHECK(UPPER(VALUE) ~ '^(VIA|VIALE|PIZZA) [A-Z \-]+ [0-9]+[A-Z]*$');

But the command above generates an error:

ERROR: column "address" of table "addresses" contains values that violate the new constraint

Why can't we enable the new constraint? Let's have a look at the data that doesn't satisfy the new constraint, with:

SELECT * from ADDRESSES where NOT ADDRESS ~ '^(VIA|VIALE|PIZZA) [A-Z \-]+ [0-9]+[A-Z]*$';

The WHERE clause in that query checks the address against the same regular expression as defined in the domain. The output is:

id | name | address ----+------+--------------------- 1 | UGA | PIAZZA DEL CORSO 8A (1 row)

We can see that the PIAZZA DEL CORSO 8A row doesn't satisfy our new constraint (it should start with PIZZA). Let's define the constraint as NOT VALID for now:

ALTER DOMAIN ITALIAN_ADDRESS ADD CONSTRAINT ITALIAN_ADDRESS_NEW_CHECK CHECK(UPPER(VALUE) ~ '^(VIA|VIALE|PIZZA) [A-Z \-]+ [0-9]+[A-Z]*$') NOT VALID;

The NOT VALID clause, allows us to add the constraint but not validate the rows in the table. If we check the validity:

ALTER DOMAIN ITALIAN_ADDRESS VALIDATE CONSTRAINT ITALIAN_ADDRESS_NEW_CHECK;

We still get the same error since we didn't update the data:

ERROR: column "address" of table "addresses" contains values that violate the new constraint

We need to update the data in the table:

UPDATE ADDRESSES SET ADDRESS=REPLACE(ADDRESS, 'PIAZZA', 'PIZZA') WHERE ADDRESS LIKE 'PIAZZA%';

Then check the data with SELECT * FROM ADDRESSES;

id | name | address ----+-------+----------------------- 1 | UGA | PIZZA DEL CORSO 8A 4 | LUIGI | VIA UNKNOWN ADDRESS 1 (2 rows)

The UGA row should now be in line with the new constraint, so we can validate it:

ALTER DOMAIN ITALIAN_ADDRESS VALIDATE CONSTRAINT ITALIAN_ADDRESS_NEW_CHECK;

Success! Now the check is enabled, therefore trying to insert a PIAZZA address is prohibited.

INSERT INTO ADDRESSES (NAME, ADDRESS) VALUES ('CARLA', 'PIAZZA MAGGIORE 33');

The above SQL throws the error:

ERROR: value for domain italian_street violates check constraint "italian_street_new_check"

Besides this example, another catch with the ALTER DOMAIN command is that any expression or function we add as part of the domain creation should be immutable. If we have a user-defined function as part of the CREATE DOMAIN expression, and we're able to modify the expression/function after domain creation, we'll get an error when reloading the data in that database.

Pay attention when dropping

What happens when dropping a domain used in one or more tables?

DROP DOMAIN ITALIAN_ADDRESS;

We get the following error, signaling that there are dependencies

ERROR: cannot drop type italian_address because other objects depend on it DETAIL: column address of table addresses depends on type italian_address HINT: Use DROP ... CASCADE to drop the dependent objects too.

If we really want to drop it we can use the CASCADE option:

DROP DOMAIN ITALIAN_ADDRESS CASCADE;

Check the table:

SELECT * FROM ADDRESSES;

No ADDRESS column anymore!

id | name ----+------- 1 | UGA 4 | LUIGI (2 rows)

It's difficult to drop an in-use domain without affecting the rest of the database tables. The beauty of centrally managed column definition is playing against us in this scenario. Since the definition is unique, any table using it will have columns disappear if the related definition gets deleted.

Conclusion

PostgreSQL domains are a powerful feature, allowing a central definition of column validation rules that can be then applied across several tables. The evolution of the definition can be tricky, but for standard, non-evolving columns it can represent a good way to centralize the definition and checks.

Check out some additional resources on the topic: