|
|
Subscribe / Log in / New account

"Big data" features coming in PostgreSQL 9.5

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

August 5, 2015

This article was contributed by Josh Berkus

PostgreSQL 9.5 Alpha 2 is due to be released on August 6. Not only does the new version support UPSERT, more JSON functionality, and other new features we looked at back in July, it also has some major enhancements for "big data" workloads. Among these are faster sorts, TABLESAMPLE, GROUPING SETS and CUBE, BRIN indexes, and Foreign Data Wrapper improvements. Taken together, these features strengthen arguments for using PostgreSQL for data warehouses, and enable users to continue using it with bigger databases.

This release is still called an "alpha" because the PostgreSQL developers don't feel that it's entirely stable yet, and want to reserve the right to remove features that can't be fixed before the beta release. Among others, this includes the TABLESAMPLE feature described below; as of this writing, there are still problems with that patch and it could be dropped from 9.5.

PostgreSQL as a big data database

Some readers may not think of PostgreSQL for big data, but the database system has a long track record of use in data warehouses. For a decade before Hadoop launched, PostgreSQL was the only pure open-source option for large data volumes and complex analytics. Today, it is still heavily used in mid-sized data warehouses and "data marts", meaning databases in the one to ten terabyte range. Big data and analytics users are a major subset of the PostgreSQL community, and thus its contributor base, resulting in the addition of new features for large databases with each release.

More prominently, the European Union has chosen PostgreSQL as a key part of a publicly funded analytics technology project in the EU's Seventh Framework Programme, or "FP7". This project aims to enhance the technology base of the EU. Funding from FP7, channeled through the consulting firm 2nd Quadrant, paid for two of the features covered below, as well as performance testing and review work on others.

In addition to core features, the PostgreSQL project has been a source of code for many big data startups over the years. These include Netezza, Greenplum, ParAccel, Truviso, Aster Data Systems, and CitusDB. Both Yahoo! Everest and Amazon Redshift are PostgreSQL forks as well. In July, startup PipelineDB released an open-source streaming database that was forked from PostgreSQL 9.4. Streaming databases are used to process huge amounts of incoming data. These startups are generally attracted to PostgreSQL because of its sophisticated query planner and executor, liberal license, and well-documented code.

As hardware gets bigger and faster, the definition of "mid-sized data warehouse" keeps getting bigger and user expectations continue to grow. The PostgreSQL project continuously adds new features and better performance for big data use cases in order to keep up. Version 9.5 includes a lot of "goodies" for users with big databases.

Faster sorts

One thing relational databases do a lot of is sorting data. The bigger the database, the more the performance of those sorts becomes the dominating factor in overall response time. Sort time is critical not just for query results, but also for building indexes, grouping data, and certain kinds of joins. To help with this, PostgreSQL 9.5 comes with an across-the-board speedup in sorting text and long numeric values, yielding 3X to 12X speed improvements for many users.

Peter Geoghegan of Heroku wrote a patch in January that made text sorts faster by using a technique called "abbreviated keys". Previously, PostgreSQL sorted all text purely by using glibc's strcoll() function, which allows the database to sort in whatever locale the user has chosen for their data. The problem with this is that strcoll() is expensive, taking hundreds of times more CPU time than comparing integers.

Converting the text values to binary keys using strxfrm() and sorting them is much faster, except that those binary keys can be very large. So Peter hit on the technique of taking only the first eight bytes of the keys produced by strxfrm(), comparing those, and then breaking any ties by doing a full strcoll() comparison. This speeds up sorts of all kinds, both in-memory quicksorts and on-disk tapesorts. One user tested 9.5 with their data warehouse, and the new version reduced indexing eighteen million text values from almost eleven minutes to less than one minute.

After Geoghegan's patch was committed, contributor Andrew Gierth wrote a patch that uses a similar technique to speed up sorts on PostgreSQL's NUMERIC data type. In Postgres, NUMERIC is an arbitrary-precision number value, supporting over 100,000 digits. This means that it can't be sorted as an integer or float, so Gierth's improvement was welcome.

However, there still may be some issues with the sort improvements. Some platforms, especially older versions of Solaris, have buggy versions of the strxfrm() function that prevent use of this optimization. The developers are discussing whether these platforms are old enough, or obscure enough, to disregard so that most users can get enhanced performance, or whether platform-specific code might be required.

TABLESAMPLE

One feature funded by FP7 was written by Petr Jelinek; it adds the SQL-standard TABLESAMPLE query. This clause returns a pseudo-random sample of the rows in the result set. When used with large tables, TABLESAMPLE lets users get a quick "glimpse" of the data so that they can do further analysis.

For example, imagine that I have a table of 10 million user profiles in JSON. I want to take a look at a handful of rows from this table so that I can find out what kinds of keys the JSON has, but I don't want just the rows at the beginning of the table. In 9.5, I can do this:

    SELECT * FROM user_profiles TABLESAMPLE SYSTEM ( 0.001 );

Which will return around 0.001%, or 100, of the rows in the table. In the query above, SYSTEM is the name of the chosen sampling algorithm. The SYSTEM algorithm chooses a set of pseudo-random data pages, and then returns all rows on those pages, and has the advantage in running in constant time regardless of the size of the table. PostgreSQL 9.5 will also ship with the BERNOULLI sampling method, which is more rigorously random, but will take longer the larger the table is.

Currently, however, the TABLESAMPLE patch is under some intense review and rewriting. Tom Lane criticized the quality of the patch, and it remains to be seen if it will stay in 9.5.

CUBE, ROLLUP, and GROUPING SETS

In the SQL99 update of the SQL standard, the ANSI committee added a collection of grouping and reporting clauses to SQL to support the technology known as OLAP for "OnLine Analytical Processing". OLAP technologies, still used today, are ways to summarize and explore large data sets by grouping them in a multi-dimensional space. The three features added to SQL, ROLLUP, CUBE, and GROUPING SETS, provide different ways of summarizing data sets on several axes at once.

ROLLUP is the easiest to understand, since it simply provides subtotals and master totals for the whole set provided. For example, this query would provide counts for each city, then subtotals for each country, and finally a grand total of subscribers.

    SELECT country, city, count(*) 
    FROM lwn_subscribers
    GROUP BY ROLLUP ( country, city );

     country  |     city      | count 
    ----------+---------------+-------
    Australia | Brisbane      |   561
    Australia | Melbourne     |   302
    Australia | Perth         |   219
    Australia |               |  1082
    USA       | New York      |   591
    USA       | Portland      |   617
    USA       | San Francisco |   610
    USA       |               |  1818
	      |               |  2900

CUBE is less immediately understandable, because it's mainly intended to feed multi-dimensional data into an external OLAP tool like Mondrian. It expresses all possible totals for all combinations of specified columns, otherwise known as a "combinatorial explosion". For example, this query would provide totals for each country and each subscription level, and each combination of country and level.

    SELECT country, level, count(*) 
    FROM lwn_subscribers
    GROUP BY CUBE ( country, level );

     country  |    level     | count 
    ----------+--------------+-------
    Australia | leader       |   140
    Australia | professional |   490
    Australia | starving     |   394
    Australia | supporter    |    58
    Australia |              |  1082
    USA       | leader       |   301
    USA       | professional |   765
    USA       | starving     |   602
    USA       | supporter    |   150
    USA       |              |  1818
	      |              |  2900
	      | leader       |   441
	      | professional |  1255
	      | starving     |   996
	      | supporter    |   208

Finally, GROUPING SETS are sort of the "parent class" of both ROLLUP and CUBE. It allows you to select subtotal and total groups and combinations manually to display, in case you need more specific output than ROLLUP or CUBE provide.

For example, say I wanted totals for level and city, then level, and then a grand total, I would write:

    SELECT city, level, count(*)
    FROM lwn_subscribers
    GROUP BY GROUPING SETS ((city, level),(level),());

	 city      |    level     | count
    ---------------+--------------+-------
     Brisbane      | leader       |    94
     Melbourne     | leader       |    44
     New York      | leader       |   105
     Perth         | leader       |     2
     Portland      | leader       |    94
     San Francisco | leader       |   102
		   | leader       |   441
     Brisbane      | professional |   236
     Melbourne     | professional |   121
     New York      | professional |   250
     ...

The benefit for users in these query constructs is the ability to get totals, subtotals, and combinations in one query and one scan over the table, which in the past would have required multiple queries against the database. The other advantage for PostgreSQL is that it can now be supported by sophisticated "business intelligence" tools that previously only worked with proprietary SQL databases.

This feature was developed by Gierth, and has caused some discussion because of issues with the PostgreSQL review process that almost caused it to not get committed. It's now in and ready for users to try out.

Note: the above subscription data is fictitious and not related to LWN's actual subscription data.

BRIN indexes

Chilean committer Álvaro Herrera developed another feature funded by FP7. He wrote it to solve one of the major problems with really big tables: indexing them. PostgreSQL supports large tables easily, especially append-only tables, where the user may not even notice that the table has grown to several terabytes. However, standard B+ tree indexes grow increasingly inefficient at large numbers of leaf nodes, primarily because they no longer fit in memory.

Herrera figured out that "close" counts in indexes as much as it does with horseshoes and grenades. He devised a new index type that would index values down to an arbitrarily large "block" of data rather than to an individual data page. Instead of having a leaf node pointing to each individual row, the new index has a range of values to be found in each block. When the table is queried, the index is searched for ranges that correspond to the supplied filter conditions, and then the individual blocks are sequentially scanned. This approach is similar to how data is organized in column-store tables.

This new index type is called "BRIN" for "Block-Range INdex". These indexes are much smaller than conventional B+ trees — as little as 1% of the size of the standard indexes. This makes them much faster, since they usually fit in memory, and sometimes even in the CPU cache.

There are some drawbacks to BRIN indexes, though. They can't be used to enforce uniqueness or support keys. BRIN works much better on values that are incrementally increasing with each row inserted, like timestamps and incrementing integers. Most of all, BRIN indexes handle updates and deletes on the underlying table inefficiently, so they're best used with append-only tables. These restrictions would seem to make BRIN indexes a niche tool, except that large, continuously increasing append-only tables are a common use case. For example, any table used to hold a website log or an audit history can benefit.

Foreign Data Wrappers

In the last couple years, PostgreSQL has gained adoption as part of hybrid solutions involving several databases because of its Foreign Data Wrapper (FDW) feature. These wrappers allow users to link to external data in other databases, whether PostgreSQL or other platforms, and query them as if they were local tables. This has been called "data federation". For version 9.5, several developers have been hard at work on enhancements to the FDW capability.

One such enhancement is primarily administrative: IMPORT FOREIGN SCHEMA. In prior versions, if you wanted to link every table in a target remote database, you had to write individual CREATE FOREIGN TABLE statements for each one, which was tedious to do and annoying to maintain. Ronan Dunklau and Michael Paquier fixed this by implementing a command that creates all tables in the remote database as foreign tables so that you can link "the whole database" in one command.

The other major FDW improvement, JOIN pushdown, was introduced by KaiGai Kohei of NEC. Together with the new "custom scan" API, this allows FDWs to perform parts of queries, like index scans and joins between two tables, on the remote database before returning the result. Pushdown makes data federation much more scalable because it no longer requires sending entire tables across the wire for many common operations.

To take advantage of either of these features, though, the FDW driver for the external data source needs to support them. The postgres_fdw driver, which supports PostgreSQL-to-PostgreSQL connections, will support both, but many of the FDW drivers won't get them added for a while, if at all. Regardless, the improvements in FDWs make it more possible for PostgreSQL to be the interface for large data stores even where it is not the backend for them.

Other features

Several other features in PostgreSQL 9.5 will be of interest to users with large databases. PostgreSQL now makes more effective use of larger allocations of memory in "shared_buffers", by shrinking the array of page references kept by each process. GiST indexes, used for geographic searches, now support "index-only scans" for faster queries over large append-only tables. The vacuumdb utility now has a parallel, multi-process mode for doing maintenance on multiple large tables at once. Finally, UNLOGGED tables can be changed into regular tables, allowing users to build a data set and make it durable as the last step. UNLOGGED tables are non-durable tables that are not written to the transaction log, and are often used in data loading. This last feature was a Google Summer of Code project.

Outside of the core project, two tools have been updated that increase the size of databases PostgreSQL can handle. CitusData released version 1.3 of its cstore_fdw project, which adds a column-store option for tables. Column stores both compress data and makes large aggregation queries, like counts and totals, much faster. The company also released version 1.2 of pg_shard, a tool that "shards" a table by distributing it across a set of database nodes using a hash key. This allows table scans to be run on multiple machines in parallel.

Native parallel query, the ability to use multiple cores on the same machine to execute a large query, has been an active project for two years. Unfortunately, due to the technical challenges being more difficult than expected by the team led by EnterpriseDB programmer Robert Haas, no parallel query features have yet been committed. Parallel table scan is under active development for 9.6.

Overall, PostgreSQL continues to hold its own as the "good enough" big data database for many users, by both improving performance and adding new analytical features. We can expect to see larger PostgreSQL databases under 9.5 than we have before. The project further shows some possibility of expanding is ability to handle even bigger databases through federation and sharding, and via numerous forks. We'll have to wait to find out what 9.6 offers in this area.

PostgreSQL 9.5 Alpha 2 will be available on August 6th from PostgreSQL.org.


Index entries for this article
GuestArticlesBerkus, Josh


(Log in to post comments)

"Big data" features coming in PostgreSQL 9.5

Posted Aug 5, 2015 20:13 UTC (Wed) by xtifr (guest, #143) [Link]

BRIN indexes are niche! It's just a fairly common and very important niche. :)

Faster indexing and, to a lesser extent, BRIN both sound like features that will seriously benefit most users, not just Big Data.

PostgreSQL used to have the reputation of being the slow-but-reliable FOSS dbms. The reliable part may have taken a minor ding with the recent multixact problems, but I think they can get over that. I think the slow part is already much less true than it used to be, but I still approve of any and all speedups they can add. If they became the faster-and-more-reliable FOSS dbms, I wouldn't mind one bit!

(I really hope the planned Parallel Table Scan thing works out well, and provides a decent performance boost as well.)

"Big data" features coming in PostgreSQL 9.5

Posted Aug 5, 2015 20:56 UTC (Wed) by kleptog (subscriber, #1183) [Link]

> BRIN indexes are niche! It's just a fairly common and very important niche. :)

Yes! One thing is saw in the initial discussions but I'm not seeing mentioned anywhere is that the theory behind BRIN indexes should be able to be applied to support Bloom filters. And they are seriously useful, especially for high cardinality columns which sort really badly. Btrees don't do nicely on those kind of columns. I'm hoping that they haven't been forgotten.

"Big data" features coming in PostgreSQL 9.5

Posted Aug 6, 2015 22:52 UTC (Thu) by xtifr (guest, #143) [Link]

Er, I meant to say, "faster sorting" sounds like a feature for everyone, not, "faster indexing". In case it wasn't obvious. Had indexes on the brain.

"Big data" features coming in PostgreSQL 9.5

Posted Aug 5, 2015 20:19 UTC (Wed) by post-factum (subscriber, #53836) [Link]

Does LWN use PostgreSQL or that were purely fictitious examples?

"Big data" features coming in PostgreSQL 9.5

Posted Aug 5, 2015 20:28 UTC (Wed) by corbet (editor, #1) [Link]

Yes and yes.

"Big data" features coming in PostgreSQL 9.5

Posted Aug 6, 2015 5:51 UTC (Thu) by JdGordy (subscriber, #70103) [Link]

scolled past the article as I'm not really interested in postgres.. saw the tables and that Melbourne was behind Brisbane (which is pretty unlikely given the relative size of the cities) and wanted to make sure my location was correct incase I was registered as **shudder** sydney or something.... and then...

"Note: the above subscription data is fictitious and not related to LWN's actual subscription data." :)

Is Josh an aussie expat?

"Big data" features coming in PostgreSQL 9.5

Posted Aug 6, 2015 18:35 UTC (Thu) by jberkus (guest, #55561) [Link]

I'm not.

However, LWN has a lot of Australian readers, and while I was writing this article, a bunch of my friends were at PyCon.AU and DjangoCon.AU. Hence the use of Australia as the 2nd set of cities.

FWIW, the data was generated using skewed random functions, so it's inherently meaningless.

abbreviated keys for BYTEA?

Posted Aug 6, 2015 7:56 UTC (Thu) by zack (subscriber, #7062) [Link]

From the text (and the patch diff, but I'm not fluent in postgres internals) is not clear to me whether Gierth's improvement means that BYTEA (byte arrays) values will also benefit from abbreviated keys.
Does any fellow LWN reader know?

abbreviated keys for BYTEA?

Posted Aug 6, 2015 9:53 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

Andrew's patch is specific to numeric.

I guess you could devise something for bytea as well, but it'd have to look a bit different. Actually it'd be much closer to the abbreviated key logic for text than to numeric. Just without having to care about locales. With numeric you have to care about NaN and such.

Do you regularly sort/index large amount of bytea values?

abbreviated keys for BYTEA?

Posted Aug 6, 2015 13:19 UTC (Thu) by zack (subscriber, #7062) [Link]

> Andrew's patch is specific to numeric.

Thanks for your answer!

> Do you regularly sort/index large amount of bytea values?

I'm storing lots of checksums (of various kinds: sha1, sha256 for now), in the order of a few billion entries.

I haven't yet firmly chosen the postgres datatype to do that.

On the one hand, I'm inclined to implement a custom data type right away. It is my understanding that while doing that one can plug into the new sort support facilities that give the benefits of abbreviated keys (right? :-)).

On the other hand, if an appropriate built-in data type (such as BYTEA or its variants) have already support for abbreviated keys, that would be a good incentive to start with it, and migrate to a custom data type only later.

abbreviated keys for BYTEA?

Posted Aug 6, 2015 19:23 UTC (Thu) by jberkus (guest, #55561) [Link]

I guess I don't understand why you'd care that much about abbreviated key sorting on checksums? It's not like you'd care about soriting them. I guess just because of index build time?

If so, well, "patches welcome". Probably what you'd want to do with BYTEA is compare the first 8 bytes, sort, then compare the full values to break ties.

On the other hand, you could just store the checksums as NUMERICs, if you are fine with converting to hex and back.

abbreviated keys for BYTEA?

Posted Aug 7, 2015 18:58 UTC (Fri) by zack (subscriber, #7062) [Link]

> I guess I don't understand why you'd care that much about abbreviated key sorting on checksums? It's not like you'd care about soriting them. I guess just because of index build time?

Index (and specifically b-tree) build time and maintenance are my main concerns, yes.

But I was also under the impression that abbreviated keys are relevant also for (b-tree) index lookups and uniqueness constraint verifications, due to the comparisons needed to get down to the actual indexes values, no matter how shallow the index is. Maybe that's a wrong impression of mine?

abbreviated keys for BYTEA?

Posted Aug 7, 2015 21:17 UTC (Fri) by jberkus (guest, #55561) [Link]

Yes, they are related. Although if you're just trying to get an exact match on a checksum, you might consider using GIN indexes instead. For a simple scalar, GIN indexes in Postgres behave a lot like hash indexes.

Anyway, I encourage you to bring up the idea of BYTEA on a PostgreSQL mailing list. Nobody's opposed to extending abbreviated keys further, we just ran out of time for 9.5.

abbreviated keys for BYTEA?

Posted Feb 5, 2016 11:17 UTC (Fri) by petergeoghegan (guest, #84275) [Link]

Tablesample with proportion really constant time?

Posted Aug 6, 2015 9:34 UTC (Thu) by epa (subscriber, #39769) [Link]

Which will return around 0.001%, or 100, of the rows in the table. In the query above, SYSTEM is the name of the chosen sampling algorithm. The SYSTEM algorithm chooses a set of pseudo-random data pages, and then returns all rows on those pages, and has the advantage in running in constant time regardless of the size of the table.
Surely not! 0.001% of a million-row table must take longer to fetch than 0.001% of a ten-row table. It would run in constant time if you specified a fixed number of rows, but not for a proportion of the total size.

Tablesample with proportion really constant time?

Posted Aug 6, 2015 11:50 UTC (Thu) by dskoll (subscriber, #1630) [Link]

I think it means that if you pick M rows out of N with N much bigger than M, the time is dependent on M only and not on N. At least, that's how I read it.

It's not constant wrt to the percentage of rows, of course, becuase in that case M depends on N.

Tablesample with proportion really constant time?

Posted Aug 6, 2015 18:47 UTC (Thu) by jberkus (guest, #55561) [Link]

That's right.

And it's not precisely constant time; it will take longer to pull 100 rows out of a billion row table than a million row table. However, the increase in time will be incremental (and small) instead of a multiple of the original request, since we're just looking up data by pageID.

For example, I tested SYSTEM between returning 100 rows from a 100000 row table vs. a million row table. Regardless of which table I used, the difference in request time was below significance thresholds. However, with BERNOULLI, the thousands table took around 5ms, whereas the millions table took around 14ms.

Tablesample with proportion really constant time?

Posted Aug 7, 2015 9:32 UTC (Fri) by epa (subscriber, #39769) [Link]

Thanks. I just meant that if you want constant time, you need to specify something other than the '0.001' proportion in the reviewer's example. It would have to be tablesample (1000 rows) or something similar, I'm not sure of the exact syntax.

Tablesample with proportion really constant time?

Posted Aug 7, 2015 15:29 UTC (Fri) by smurf (subscriber, #17840) [Link]

AFAIU the random sampling itself (i.e. deciding which entries to take) would take const time. Retrieving the actual data, obviously not.

"Big data" features coming in PostgreSQL 9.5

Posted Aug 7, 2015 17:12 UTC (Fri) by nix (subscriber, #2304) [Link]

Hm. The glibc folks, last I heard, were mostly under the impression that strxfrm() was pretty much useless (as was I, to be honest). Now that a use has been found, someone should drop a note to the libc-alpha list to inform them that it is not as useless as all that!

(e.g. Paul Eggert discussing a failed attempt to use it in GNU sort in <https://sourceware.org/ml/libc-alpha/2014-11/msg00673.html>.)

I wonder why it was too slow for GNU sort while simultaneously being fast enough for PostgreSQL? It's not like it's sped up noticeably in the intervening years.

"Big data" features coming in PostgreSQL 9.5

Posted Aug 7, 2015 21:20 UTC (Fri) by jberkus (guest, #55561) [Link]

I think the difference is the abbreviated keys approach.

"Big data" features coming in PostgreSQL 9.5

Posted Aug 8, 2015 6:01 UTC (Sat) by kleptog (subscriber, #1183) [Link]

I think it also has to do with the fact that in PostgreSQL the abbreviated keys approach allows the removal of a lot of indirect function calls from tight loops. In specialised code that knows beforehand that it's sorting text data replacing strcoll() with strxfrm() might not be that much of a win. But in the general PostgreSQL sorting code there is always an extra layer of indirection (for example TEXT values internally are not null-terminated, so you have to compensate for that).

"Big data" features coming in PostgreSQL 9.5

Posted Aug 8, 2015 7:24 UTC (Sat) by petergeoghegan (guest, #84275) [Link]

strxfrm() blobs are usually a little over 3 times as large as the original strings with glibc. As I go into in my blog post about abbreviated keys, this is because there are multiple "levels" (at least 3) in the blob, demarcated by sentinel bytes, whose order relative to each other relates to how heavily some aspect of a code point should be weighed. For example, with Latin scripts, primary alphabetical ordering is represented at the primary level (case and punctuation are represented at subsequent levels). You get a far higher concentration of entropy in the first 8 bytes than (say) the last 8 bytes. Sometimes that concentration is much higher than you'd expect, since (for example) whitespace and diacritics are also not represented at the primary weight level. Also, by just comparing the first 8 bytes, a significant amount of pointer chasing is avoided (other indirection is avoided too).

The idea that strxfrm() is not generally useful seems dubious, even leaving aside a technique like abbreviated keys. It's in the C standard. And Certainly, ICU offers something that's similar but very much more advanced. The glibc docs strongly suggest using strxfrm() where the space overhead is acceptable during a sort of a non-trivial number of strings.

"Big data" features coming in PostgreSQL 9.5

Posted Aug 9, 2015 12:20 UTC (Sun) by nix (subscriber, #2304) [Link]

I think the abbreviation is helpful but perhaps not the primary reason why strxfrm() works here but not for sort(1). The biggest problem with strxfrm() is that the blobs are so much larger than the input that the dcache hit simply ruins builds and comparisons using it: you blow the dcache building the blobs and then blow it again using them. Hence the cutover point from cache-dominated to memory-dominated comes much earlier when you're using strxfrm() blobs than when you're using straight strcoll(), which is a killer for things like sort(1) which are massively affected by cache, and repeatedly so, and tries to avoid hitting the disk.

Index building... is of a different order. The memory hierarchy is less important, because everything is hitting the disk anyway, and because you abbreviate the blobs their horrendous size is reduced to something well below one cacheline for every blob.

(This has been a genuine guess: no numbers or experiments were conducted and no strxfrm() blobs were harmed in the making of this post. But I have tried to use strxfrm() before, myself, long ago, and come to similar conclusions back then, though it was the Solaris strxfrm() bugs that really killed that idea.)

"Big data" features coming in PostgreSQL 9.5

Posted Aug 16, 2015 0:24 UTC (Sun) by ron.dunn (guest, #104059) [Link]

Ajilius is a data warehouse automation product that builds star schemas for PostgreSQL and EnterpriseDB. We've seen great performance gains, processing large dimensions, from the indexing and sorting enhancements. The OLAP functions are nice, but possibly too late as their value has been subsumed by the in-memory engines of BI products like Qlik and PowerBI.


Copyright © 2015, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds