When tsvector was working as expected on most hosts, but not one

One of the primary uses for FreshPorts is searching various characteristics of the ports. One highly used search feature is pkg-messages (background at pkg-message). A problem which has been vexing me for about 2 months was the lack of results on prod, but results which worked on dev. The issue was raised on GitHub. Initially, it was DDL related (adding a column to the ports table fixed that on test and stage). However, the problem persisted on prod.

Here is an example, which appears later when we get deeper into it:

SELECT DISTINCT id as port_id
  FROM ports
 WHERE pkgmessage_textsearchable  @@ websearch_to_tsquery($1) OR
       pkgmessage_textsearchable2 @@ websearch_to_tsquery($1)

On prod, nothing. On all other systems, we get a few hundred rows. Why?

The key difference between all these databases: the working databases were self-hosted (PostgreSQL 12.13 on FreeBSD 13.1); the problem database was hosted on RDS in AWS (PostgreSQL 12.11 on something…).

It was because of this difference that I feared RDS didn’t support what I was doing. I was wrong.

The problem was configuration. The problem wasn’t that it didn’t work on prod, but why did it work everywhere else?

I’ll give some background as the fields involved in the search, the queries, and how we fixed it. I learned something new yesterday and that’s what is prompting this blog post, so I don’t forget it.

In short, the column was defined with ‘english‘ as the regconfig, but the search query was using the database default value, which was ‘english‘ on some hosts and ‘simple‘ on the problem host.

If you want to skip to why, read So why did it work and not work?

Background

All three databases should contain the same data. That theory was tested by doing queries via ilike which would find the information in the raw data, when the full text search did not.

I first started using this full text search approach back in March 2021 (it was a similar approach to the problem issue, but not the precise one in question). That search was “which port installs this file” – which works without having the port installed. Perhaps your application is calling for that file and you don’t know what installs it. FreshPorts can find it for you. Details in Replacing a column search with a full text search solution.

After reviewing that post, it seems all of my tsvector solutions may require adjustment.

The table

There are two columns on the ports table, one which splits up path names, and now which does not. This table listing is greatly redacted to keep the information to a minimum.

freshports.devgit=# \d ports
                                                                                    Table "public.ports"
           Column           |           Type           | Collation | Nullable |                                                   Default                                                   
 id                         | integer                  |           | not null | nextval('ports_id_seq'::regclass)
 element_id                 | integer                  |           | not null | 
...
 pkgmessage                 | text                     |           |          | 
...
 pkgmessage_textsearchable  | tsvector                 |           |          | generated always as (to_tsvector('english'::regconfig, pkgmessage)) stored
 pkgmessage_textsearchable2 | tsvector                 |           |          | generated always as (to_tsvector('english'::regconfig, translate(pkgmessage, '/'::text, ' '::text))) stored
Indexes:
...
    "ports_pkgmessage_textsearchable2_idx" gin (pkgmessage_textsearchable2)
    "ports_pkgmessage_textsearchable_idx" gin (pkgmessage_textsearchable)
...

The original data is stored in pkgmessage and searches are preformed over the pkgmessage_textsearchable andpkgmessage_textsearchable2 columns.

The query

The main part of the query looks like this:

SELECT DISTINCT id as port_id
  FROM ports
 WHERE pkgmessage_textsearchable  @@ websearch_to_tsquery($1) OR
       pkgmessage_textsearchable2 @@ websearch_to_tsquery($1)

On the development, test, and staging hosts, we get results like this:

freshports.devgit=# SELECT id as port_id, element_pathname(element_id)
  FROM ports
 WHERE pkgmessage_textsearchable2  @@ websearch_to_tsquery('example');
 port_id |                           element_pathname                           
---------+----------------------------------------------------------------------
  100421 | /ports/branches/2022Q1/dns/dnsmasq
  100428 | /ports/branches/2022Q1/emulators/xsystem35
   14686 | /ports/head/sysutils/lmon
... etc

I am searching only over one column here, but that doesn’t affect the problem in question.

Running the same query on production yields no results:

freshports.org=> SELECT id as port_id, element_pathname(element_id)
  FROM ports
 WHERE pkgmessage_textsearchable2  @@ websearch_to_tsquery('example');
 port_id | element_pathname 
---------+------------------
(0 rows)


freshports.org=> 

Why? Let’s check the raw data.

freshports.org=> SELECT id as port_id, element_pathname(element_id)
  FROM ports
 WHERE pkgmessage ilike '%example%';
 port_id |                           element_pathname                           
---------+----------------------------------------------------------------------
   34126 | /ports/head/security/pond
   74559 | /ports/branches/2015Q3/emulators/linux_base-c6
   60310 | /ports/branches/2020Q4/www/gitlab-ce
   38345 | /ports/head/www/gitlab
... etc

Yes, there are items which match. Granted, that word might not stand on its own. So let’s check with leading and trailing spaces:

freshports.org=> SELECT id as port_id, element_pathname(element_id)
  FROM ports
 WHERE pkgmessage ilike '% example %';
 port_id |                element_pathname                
---------+------------------------------------------------
   60310 | /ports/branches/2020Q4/www/gitlab-ce
   38345 | /ports/head/www/gitlab
   71388 | /ports/branches/2022Q2/math/saga
   51088 | /ports/branches/2018Q4/emulators/qemu
   64192 | /ports/branches/main/editors/lazarus-qt5
   22159 | /ports/head/devel/boost_build
... etc

So, where to next?

Social media

I posted on social media and I received a few ideas via private messages.

My first idea was reindex table concurrently ports – that didn’t help.

Next, we checked Collate and Ctype settings. Both matched:

prod, has problem


                                     List of databases
      Name      |  Owner   | Encoding  |   Collate   |    Ctype    |   Access privileges   
----------------+----------+-----------+-------------+-------------+-----------------------
 freshports.org | postgres | SQL_ASCII | C           | C           | 


dev, no issues:


[pg02 dan ~] % psql -l
                                        List of databases
            Name             |    Owner     | Encoding  | Collate | Ctype |   Access privileges   
-----------------------------+--------------+-----------+---------+-------+-----------------------
 freshports.devgit           | postgres     | SQL_ASCII | C       | C     | 

All the same. Nothing there.

tsvector values

We started looking at tsvector next. Here are the values from both columns.

From Controlling Text Search:

to_tsvector parses a textual document into tokens, reduces the tokens to lexemes, and returns a tsvector which lists the lexemes together with their positions in the document. The document is processed according to the specified or default text search configuration.

The following is the results of invoking to_tsvector on the pkg-message file for security/pond as found above in one of the result sets.

It shows us that ‘example’ is the 27th word in the document. However, notice that it is ‘exampl`, without the trailing ‘e’.

freshports.org=> select pkgmessage_textsearchable from ports where id = 34126;
                                pkgmessage_textsearchable                                                                                                              
----------------------------------------------------------------------------------------------------------
 '/tmp':32 '1':10 '2':21 '3':33 '5':17 'client':6,36 'directori':25 'exampl':27 'follow':13,38 'instruct':40 'line':14
 'load':19 'loader.conf':16 'mount':22,28 'place':11 'pond':3,35,41 'requir':8 'run':34 'tmpfs':9,18,24,30,31 
 'use':2 'yes':20
(1 row)


freshports.org=> select pkgmessage_textsearchable2 from ports where id = 34126;
                              pkgmessage_textsearchable2                                                                                                             
---------------------------------------------------------------------------------------------------------
 '1':10 '2':21 '3':33 '5':17 'client':6,36 'directori':25 'exampl':27 'follow':13,38 'instruct':40 'line':14 
 'load':19 'loader.conf':16 'mount':22,28 'place':11 'pond':3,35,41 'requir':8 'run':34 'tmp':32 
 'tmpfs':9,18,24,30,31 'use':2 'yes':20
(1 row)


freshports.org=> 
[sourcecode]

In both columns (the collection of lexemes), we have the same 'exampl'.

Let's look at a database which runs this query with expected results (which have been adjusted to readable without scrolling).


freshports.devgit=# select pkgmessage_textsearchable2 from ports where id = 34126;
                           pkgmessage_textsearchable2                                                                                                             
---------------------------------------------------------------------------------------------------------
 '1':10 '2':21 '3':33 '5':17 'client':6,36 'directori':25 'exampl':27 'follow':13,38 'instruct':40 'line':14
 'load':19 'loader.conf':16 'mount':22,28 'place':11 'pond':3,35,41 'requir':8 'run':34 'tmp':32 
 'tmpfs':9,18,24,30,31 'use':2 'yes':20
(1 row)


freshports.devgit=# select pkgmessage_textsearchable from ports where id = 34126;
                             pkgmessage_textsearchable                                                                                                              
----------------------------------------------------------------------------------------------------------
 '/tmp':32 '1':10 '2':21 '3':33 '5':17 'client':6,36 'directori':25 'exampl':27 'follow':13,38 'instruct':40 'line':14
 'load':19 'loader.conf':16 'mount':22,28 'place':11 'pond':3,35,41 'requir':8 'run':34 'tmpfs':9,18,24,30,31 'use':2 
 'yes':20
(1 row)


freshports.devgit=# 

So look at that. I see exampl (without the trailing e) – now that’s interesting.

Let’s try searching for the shortened word.

freshports.org=> SELECT id as port_id, element_pathname(element_id)
  FROM ports
 WHERE pkgmessage_textsearchable2  @@ websearch_to_tsquery('exampl');
 port_id |                           element_pathname                           
---------+----------------------------------------------------------------------
   34126 | /ports/head/security/pond
   74559 | /ports/branches/2015Q3/emulators/linux_base-c6
   60310 | /ports/branches/2020Q4/www/gitlab-ce
   38345 | /ports/head/www/gitlab
   46842 | /ports/branches/2018Q1/mail/postfix-sasl
...

Hah! It’s finding that. Good. Progress. And not finished yet. Why does it work on one host, but not the other, given we have the same lexemes.

What’s the default_text_search_config setting on each host?

The configuration parameter default_text_search_config specifies the name of the default configuration, which is the one used by text search functions if an explicit configuration parameter is omitted. It can be set in postgresql.conf, or set for an individual session using the SET command. (taken directly from that documentation URL).

Also from the docs: “A text search configuration specifies all options necessary to transform a document into a tsvector: the parser to use to break text into tokens, and the dictionaries to use to transform each token into a lexeme.”

What is that setting on my hosts? On the problem host:

freshports.org=> show  default_text_search_config ;
 default_text_search_config 
----------------------------
 pg_catalog.simple

On the working hosts:

freshports.devgit=# show  default_text_search_config;
 default_text_search_config 
----------------------------
 pg_catalog.english
(1 row)

freshports.devgit=# 

There is a difference, and this turns out to be the cause of the problem.

However, as pointed out by Tom Lane’s reply to my post on the PostgreSQL General mailing list, that setting is not going to play well with my column definitions, both of with use this:

...to_tsvector('english'::regconfig...

Because ‘because “english” and “simple” will stem words differently.’ – OK, let’s look.

In short, the

Example tests

Here are the tests Tom provided:

regression=# select websearch_to_tsquery('english', 'example');
 websearch_to_tsquery 
----------------------
 'exampl'
(1 row)


regression=# select websearch_to_tsquery('simple', 'example');
 websearch_to_tsquery 
----------------------
 'example'
(1 row)

From that, it is clear that I should be using simple, not english.

Where are my settings set?

Settings can come from a configuration item or from the default value for the system.

The problem host:

freshports.org=> select setting, source from pg_settings where name = 'default_text_search_config';
      setting      | source  
-------------------+---------
 pg_catalog.simple | default
(1 row)

The working hosts:

freshports.devgit=# select setting, source from pg_settings where name = 'default_text_search_config';
      setting       |       source       
--------------------+--------------------
 pg_catalog.english | configuration file
(1 row)

Proof of concept

Let’s create a new column and index, in prod, and see how that goes.

Here we go, on the problem database, create a new field, based on simple, not english.

ALTER TABLE public.ports
    ADD COLUMN pkgmessage_textsearchable3 tsvector generated 
     always as (to_tsvector('simple'::regconfig, translate(pkgmessage, '/'::text, ' '::text))) stored;

Index it:

CREATE INDEX ports_pkgmessage_textsearchable3_idx
    ON public.ports USING gin
    (pkgmessage_textsearchable3)
    TABLESPACE pg_default;
CREATE INDEX

Try the problem query:

freshports.org=> SELECT id as port_id, element_pathname(element_id)
  FROM ports
 WHERE pkgmessage_textsearchable3  @@ websearch_to_tsquery('example');
 port_id |                           element_pathname                           
---------+----------------------------------------------------------------------
   34126 | /ports/head/security/pond
   74559 | /ports/branches/2015Q3/emulators/linux_base-c6
   60310 | /ports/branches/2020Q4/www/gitlab-ce
   38345 | /ports/head/www/gitlab
   46842 | /ports/branches/2018Q1/mail/postfix-sasl
   51532 | /ports/branches/2019Q1/graphics/drm-legacy-kmod
... etc

Success. Thank you Mr Lane.

Looking at the values

Let’s have a close look at the tsvector values (I have wrapped the output for readability):

freshports.org=> select pkgmessage_textsearchable3 from ports where id = 34126;
                                                                                                                                                            pkgmessage_textsearchable3                                                                                                                                                            
-------------------------------------------------------------------------------
 '1':10 '2':21 '3':33 '5':17 'a':5,23 'and':37 'as':4 'client':6,36 'directory':25 
 'example':27 'follow':38 'following':13 'for':26 'in':15 'instructions':40 'line':14 
 'load':19 'loader.conf':16 'mount':22,28 'place':11 'pond':3,35,41 'require':8 'run':34 
 't':29 'the':12,39 'tmp':32 'tmpfs':9,18,24,30,31 'to':1 'use':2 'yes':20 'you':7
(1 row)

So why did it work and not work?

On both hosts, the column was being indexed using ‘english’, which converted ‘example’ to the value ‘exampl’.

In my query, my search parameter was also being converted, using the configuration defined on that host.

In production, the search criteria would be converted like this:

freshports.org=> select websearch_to_tsquery('example');
 websearch_to_tsquery 
----------------------
 'example'
(1 row)

That is, no change, and no match to the tsvector lexemes.

On the working hosts, this happened:

freshports.devgit=# select websearch_to_tsquery('example');
 websearch_to_tsquery 
----------------------
 'exampl'
(1 row)

So, the same converted value, which matches the tsvector lexemes which means the search succeeds.

In short, the column was defined as ‘english’ and we were searching using ‘simple’.

Fixing the original query

I could fix the original query by supplying regconfig when invoking to_tsvector. Here is the original problem query, slightly altered:

freshports.org=> SELECT id as port_id, element_pathname(element_id)
  FROM ports
 WHERE pkgmessage_textsearchable2  @@ websearch_to_tsquery('english', 'example');
 port_id |                           element_pathname                           
---------+----------------------------------------------------------------------
   34126 | /ports/head/security/pond
   74559 | /ports/branches/2015Q3/emulators/linux_base-c6
   60310 | /ports/branches/2020Q4/www/gitlab-ce
   38345 | /ports/head/www/gitlab
   46842 | /ports/branches/2018Q1/mail/postfix-sasl
   56061 | /ports/head/games/homura
   51532 | /ports/branches/2019Q1/graphics/drm-legacy-kmod
...etc

The original query used websearch_to_tsquery(‘example’);. That is why it didn’t work on prod. Because the default regconfig (‘simple’) didn’t match the column (‘english'(.

This query worked on the other hosts, because the default regconfig (‘english’) matched the column (‘english’);

Why store the tsvector values?

I was asked: why do you have tsvector columns in the table? As opposed to just having an index over a tsvector of the actual text column.

I think it was a progression thing. I started following a tutorial or something.

Instead of having an index over the column pkgmessage_textsearchable, as I have now, I could create an index such as to_tsvector(‘english’::regconfig, pkgmessage).

Perhaps that is for future work.

Edit 2022-12-19 – I tried a proof-of-concept. It failed. Initially. Then I added a new index.

See https://explain.depesz.com/s/eb6t – I’ll add more details here as I complete this task later.

edit 2022-12-23 – the index conversion was documented in Moving from storing the tsvector values to a tsvector index.

What’s next

Next, I have to decide if I want to modify my queries to explicitly specify regconfig or rely upon the configuration setting. What are the pros and cons of each? Please let me know your thoughts via the comments.

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive

Leave a Comment

Scroll to Top