Adventures in collating text in Linux, Docker and Postgres

7 minute read

It’s interesting how bugs can lead to a deeper understanding of how computer systems work. It’s also amazing how nuanced some technical topics can be. For instance, one would think that sorting text should have the same behaviour in any English-like locale, right? Wrong. Join me on a dive down a rabbit hole that I had at work a while ago.

Linux, Docker and Postgres collation

All good problems start with a failing test

A weird problem turned up in one of our projects: the test suite was passing on my local dev box and on our Jenkins instance, but not within a Docker container running on our GitLab-CI system1.

The failing test checked that a list was in alphabetical order. In particular, we expected the list to be in this order:2

'Fast Ice', 'FYI (First Year Ice)', 'MYI (Multi Year Ice)'

However debugging the output from the Docker container on GitLab-CI gave this ordering:

'FYI (First Year Ice)', 'Fast Ice', 'MYI (Multi Year Ice)'

After first having thought “Err, what??”, somewhere at the back of my brain I had another thought: “Hrm, I’ve seen something like this before”. I find this happens quite often now. Perhaps this is one of the few advantages to getting older: there’s much more experience to draw from when debugging.

Many ideas to try to get the sorting sorted

So, I was fairly sure that the problem looked like a collation issue (i.e. the sorting order of text and something that is often related to the locale).

The first thing I checked was the LC_ALL setting, which can be used to override all locale settings. This wasn’t the problem, because it wasn’t set anywhere. An interesting part of working out if LC_ALL was a problem was finding out that one shouldn’t use LC_ALL.

How about the LANG setting (i.e. the default locale)? On my dev box I set it to en_NZ.UTF-8 (because I can); on our production systems we usually use en_US.UTF-8 because, well, that’s the most common English variant on computer systems these days. Checking the output of the locale command didn’t show up anything that should be causing problems.

Setting LANG=C in the environment also didn’t change anything, which I thought odd, because the C (POSIX) sorting order is slightly different from the en_*.UTF-8 sorting orders. Hrm.

Ok, what’s the next thing to try? Maybe the problem was UTF8? I mean, it’s really unlikely, but still possible, right? Also no change; using a non-UTF-8 locale didn’t fix the failing test. Oh well.

Maybe the environment was being reset in each command in the GitLab-CI config? I.e. what about setting LC_COLLATE or LANG just for the test command itself? Wrong again.

Next idea: was it a discrepancy between the dependencies installed in the container and those used on my local system (e.g. a version was different somewhere and hence an upstream bug was causing the problem)? Nope, that wasn’t it either.

Was there something in the Docker container runtime environment that was a bit dodgy? No, everything looked good.

What about the Python version? Did the version being used in GitLab-CI not match what was being used in the other environments? Unfortunately, everything matched.

Maybe the Debian release wasn’t right? For instance, I use Buster on my local development system, maybe I was using Stretch on the GitLab-CI system (you never know)? Hrm, another dead end.

Perhaps the Postgres version was too new? Could it be the case that some behaviour had changed in the collation? I mean, this was also really unlikely, but stuff like this happens and it’s worthwhile trying to remove all potential sources of error. After all:

When you have eliminated all which is impossible, then whatever remains, however improbable, must be the truth.

Sherlock Holmes, The Adventure of the Blanched Soldier

Unfortunately, this idea was also a dead end.

And then a penny dropped. What if the issue isn’t at the application or shell level? What if setting environment variables for the locale wasn’t a sufficiently fundamental change (somehow not happening “low enough down” in the system)? What if Postgres is using a different collation to that specified in the shell (e.g. via LANG)?

I tried installing Postgres in the Docker container with the LANG variable prefixed. I.e.

$ LANG=en_US.UTF-8 apt install -y postgresql

I will have gotten this idea from somewhere, however I can’t find the link anymore. This didn’t help things either, so the problem must be “further down” somehow.

Maybe Postgres really is where the problem is hiding?

Another penny that had been circling my head for about an hour or so dropped. Does Postgres define its own collation?

The answer: sort of. It seems to use the value from the operating system at the time of installation. In other words, setting the LANG variable at installation time (as tried in the previous section) was on the right track, but not quite there yet.

A bit more digging around online (mostly StackOverflow) produced the “handy command of the day”:

$ sudo -u postgres psql -c 'select datname, datcollate from pg_database;'

This shows the collation locale settings for each database in the Postgres installation. For example:

       datname        | datcollate
----------------------+-------------
 postgres             | en_NZ.UTF-8
 template1            | en_NZ.UTF-8
 template0            | en_NZ.UTF-8

On my local box this was en_NZ.UTF-8 (I’m a stubborn Kiwi living in Germany, which makes life interesting at times: I find all kinds of corner cases with locales), but in the Docker container it was C, which as far as I can tell is also the “POSIX” collation.

This is what my brain had been trying to tell me (from roughly 2 hours ago): one system is using C (a.k.a. POSIX) collation (which lists capital letters first3) and the other system was using English collation (lower case letters first). In other words: the Postgres installation on the GitLab-CI system was using the POSIX collation convention because as far as it was concerned, the LANG environment was C, however on all other systems I’d been testing on (my local development box and our Jenkins server) the LANG environment was some variation on en_*.UTF-8. Ok, I think I’ve started to understand the problem. We’re getting somewhere!

(Insert here a long diversion with me reading the specs for the LC_* variables, how collation works in glibc and in unicode. Yes, I even read the CLDR DTDs for unicode. Hey, I’m a geek!)

Aside: note that it’s possible to set the datcollate table entries directly in Postgres and hence change the collation settings dynamically. Fortunately doing that didn’t solve the problem for me, as that would have taken me down a very evil path.

Building a solution to the problem

Now, how to set that up properly and repeatably within a GitLab-CI Docker container?

After much wailing and gnashing of teeth, the solution I came up with was:

  • install the locales package (apt-get install -y locales)
  • set an English locale to use (echo 'en_US.UTF-8 UTF-8' >> /etc/locale.gen); en_US is more standard than en_NZ; I don’t want to confuse my colleagues too much
  • generate the locale just defined with the locale-gen command
  • lastly, update the locale at the system level (update-locale LANG=en_US.UTF-8) before installing Postgres via apt-get install -y postgresql4

In code (as part of the GitLab-CI configuration), this ended up looking like:

default:
    image: python:3.7-buster

stages:
    - test

tests:
    stage: test
    script:
        - apt-get update && apt-get -y install virtualenv libpython3-dev gettext libjpeg-dev libgraphviz-dev locales
        - echo 'en_US.UTF-8 UTF-8' >> /etc/locale.gen
        - locale-gen
        - update-locale LANG=en_US.UTF-8
        - apt-get -y install sudo postgresql
        ... more setup ...
        - make test

An interesting issue popped up with this solution because one should be able to set the locale at the system level via the localectl command, however this command isn’t available in Python Docker images (such as python:3.7-buster) because systemd isn’t used. This is why I had to use update-locale above.

The Debian and Arch Linux docs were very helpful in working out these details; they form an excellent resource when trying to solve such low-level problems.

Now that the value in the datcollate column is set in Postgres how we want (namely to en_US.UTF-8) we find that the tests now pass as expected, because we now have the sorting order that we expected to have in the beginning. Cool!

The interesting question now is: what was I trying to do before I started all of this???

I aim to please!

I hope that this post was helpful or useful! If you have any comments, questions or feedback, feel free to ping me on Mastodon or drop me a line via email.

  1. Why have Jenkins and GitLab? Well, I was in the process of migrating our CI infrastructure from Jenkins to GitLab-CI at the time, hence I was using both systems. 

  2. For those who might be wondering: fast ice isn’t ice that’s going really fast; it’s actually ice that’s stuck fast, i.e. not moving. 

  3. Some sources (e.g. the Postgres collation docs) refer to this ordering as the “Traditional C” or “POSIX” collation sequence. The Open Group and IEEE specifications imply the ordering where capitals precede lower case letters. This is also known as the ASCIIbetical order because it follows the “alphabetical” ordering used in the ASCII character set. 

  4. Note that even though the desired locale is defined, this doesn’t mean that setting it globally in the environment or locally for the test command (or even installation command) has the desired effect: datcollate stays set to “C”. Therefore it’s necessary to set the locale at the system level via update-locale before the postgresql package is installed. 

Support

If you liked this post and want to see more like this, please buy me a coffee!

buy me a coffee logo