Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Postgres 10, Performance and You

Postgres 10, Performance and You

Postgres 10 is out this year, with a whole host of features you won't want to miss. Join the Heroku data team as we take a deep dive into parallel queries, native json indexes, and other performance packed features in PostgreSQL. Whether you've got no idea what Postgres version you're using or you had a bowl of devops for dinner, you won't want to miss this talk.

Gabe Enslein

April 18, 2018
Tweet

More Decks by Gabe Enslein

Other Decks in Programming

Transcript

  1. Why PostgreSQL is great Native JSON support Native Hash support

    Strict Data Integrity commitment and enforcement Continued excellence in subqueries Numerous extensions for data flexibility Standards built-in natively
  2. How PostgreSQL 10 got even better • Native Table partitioning

    • Hash Index resiliency • Native Full-Text Search for JSON and JSONB • Parallel Query capabilities for Indexes, Outer Joins • Query Analyzer accuracy with parallel changes
  3. Terms/Phrases in the talk you’ll hear • Bloat • Vacuum

    • Sequential scan • Replication • Pruning • Statistics If you want to learn more, see me after
  4. Who is this Gabriel character? Joined Heroku Data late 2016

    Sinatra and RoR API Services Focused on Postgres High Availability data growth PostgreSQL, MySQL, SQL Server, Cassandra, Couchbase…(please no more)
  5. Super Secret Project: Social interest platform Each person as a

    Customer: • Customers will have profiles • Each customer can posts from areas of interest ◦ Favorite Quotes ◦ TV Shows ◦ Movies ◦ Foods ◦ Etc. • Imagine millions of users and hundreds of thousands of posts per day
  6. Great idea to spread uplifting thoughts on the Internet! ActiveRecord.migration

    do change do create_table :quotes do primary_key :id integer :character_id text :quote timestamps end end end Sharing Inspirational Quotes
  7. Each row size: 0.5kb Our DB has 16 GB RAM

    Most quotes, never updated after first week posted We usually only want quotes posted or updated in the last week Uh-oh, we have 750 Million quotes and counting
  8. Quotes table size 375 GB Way more than memory we

    have even for 1/20 of the table Here is what Postgres will do: • It pulls each record from disk, and it attempts to put it into memory. • Postgres does this because memory is fast, and if you needed it once, you might need it again soon. Uh-oh, we have 750 Million quotes and counting
  9. • What happens if you’re putting things into memory that

    you don’t use? ◦ Postgres will use temp disk space to manage large queries • It caches each whole record, then creates pointers to disk for data overflow • Postgres can stop accepting connections if we run too many of these queries at the same time But the dataset is too big for memory what happens now?
  10. Who likes fast queries? Who likes to keep the database

    up? Who likes having database connections? But the dataset is too big for memory what happens now?
  11. Great idea to spread uplifting thoughts on the Internet! ActiveRecord.migration

    do change do create_table :quotes do primary_key :id integer :character_id text :quote timestamps end execute "ALTER TABLE quotes PARTITION BY RANGE (updated_at)" end end Sharing Inspirational Quotes: Native Partitioning!
  12. Native Table Partitioning Entirely avoid the VACUUM overhead caused by

    a bulk DELETE. When most heavily accessed rows of a given table are in a single partition or a small number of partitions, faster lookups Seldom-used data can be migrated to cheaper and slower storage media. Why use native partitioning?
  13. Native Table Partitioning Query performance improved dramatically Performance improved by

    sequential scan of a partition instead of index and random access reads Bulk loads and deletes can be accomplished by adding or removing partitions. Native partitioning is integrated into Postgres internal operations now Pg_partman is still supported, but it’s not built into 10’s new optimizations Why use native partitioning?
  14. Partitioning isn’t natively supported in ORMs Some gems have existed

    to attempt to integrate partitioning into ActiveRecord Tables can be created from subsections of the original Partitioning (month, year, quote text, etc.) Subpartitions can be made on the Tables from the original partition as well Native Table Partitioning Noteworthy items
  15. Customer shipping addresses ActiveRecord.migration do change do create_table :addresses do

    primary_key :id customer_id :integer text :street_address text :city text :state text :zip_code text :country_code end end end Competing with delivery businesses We will compete with home delivery companies Direct to Customers from the platform
  16. Customer shipping addresses But there are problems! • Our delivery

    drivers have a hard time finding addresses • Addresses aren’t uniform across North America • Customer Satisfaction is very important, special instructions should be considered Competing with delivery businesses
  17. Customer shipping addresses How can we store all of this

    seemingly unstructured data? Introducing Hstore • HStore is not new • it’s a key/value store in Postgres Let’s implement the feature in HStore to track extra delivery info Competing with delivery businesses
  18. Customer shipping addresses ActiveRecord.migration do change do create_table :addresses do

    primary_key :id customer_id :integer text :street_address text :city text :state text :zip_code text :country_code hstore :usage # Usage hash example: { primary: true/false, type: 'home/business/aparment' } end end end Competing with delivery businesses
  19. Customer shipping addresses Wait, why is the address lookup so

    slow? Addresses table has 250 million records • Customers have multiple addresses • People in apartments move frequently but don’t remove old ones Our drivers don’t want to wait on slow updates out on the road Luckily we can add an index, and really improve lookups! Competing with delivery businesses
  20. Customer shipping addresses ActiveRecord.migration do change do create_table :addresses do

    primary_key :id customer_id :integer text :street_address text :city text :state text :zip_code text :country_code hstore :usage # Usage hash example: { primary: true/false, type: 'home/business/aparment' } end execute "CREATE INDEX addresses_usage ON addresses USING HASH(usage)" end end Competing with delivery businesses
  21. But wait... Hash Index resiliency Postgres 9.x ERROR: could not

    read block 32570 of relation base/16390/2663: read only 0 of 8192 bytes Customer shipping addresses It broke though, what happened?
  22. Hash Index resiliency Postgres 9.x Why the resiliency issues with

    hash indexes? • Not treated as first class citizens • Not replicated • Hard to maintain • Misleading failure modes (XX001, XX002) • Performance was generally poor under heavy load
  23. Hash Index resiliency Postgres 9.x If customers add/update addresses frequently,

    the hash index becomes stale Replicas/Followers won’t have the index at all Index has to be rebuilt if database crashes (OS goes down, power outage, etc.) Customer shipping addresses
  24. Hash Index resiliency Postgres 9.x Recreating large hash indexes can

    be time consuming Careful not to rely on hash indexes, severely outdated indexes can imitate corruption (XX001, XX002) What’s the solution?! Customer shipping addresses
  25. Now first-class citizens Crash safe Replicated Reduced locking during bucket

    splits Faster lookups More even index growth Single-page pruning Hash Index resiliency Postgres 10 Crash Safe, Replicable Hash Indexes
  26. Customer shipping addresses ActiveRecord.migration do change do create_table :addresses do

    primary_key :id customer_id :integer text :street_address text :city text :state text :zip_code text :country_code hstore :usage # Unstructured Data, data differing between locales (US vs. EU vs. CA, etc.) end execute "CREATE INDEX addresses_usage ON addresses USING HASH(usage)" end end Hash Index resiliency Postgres 10
  27. Hash Index resiliency Postgres 10 Customer shipping addresses Usage index

    will persist after a crash or failure Faster lookups on the hstore index for the values Reduced locking to alleviate concurrency - multiple edits to a single address in quick succession more likely to succeed Replicas/Followers will have the indexes
  28. Curated Content: Characters of Interest Interest Outlet: Our favorite Characters

    Hundreds of thousands of TV Shows and Movies Each show or move can have many characters (some have hundreds) Varying amounts of information may be acquired over time about each character or show
  29. Curated Content: Characters of Interest Specific information for each character

    of interest: • Genres • Catchphrases • Famous accessories (items, weapons, clothes) Need unstructured way of keeping that data colocated Interest Outlet: Our favorite Characters
  30. Storing JSON in Postgres Curated Content: Characters of Interest ActiveRecord.migration

    do change do create_table :characters do primary_key :id integer :media_id text :name jsonb :metadata end end end
  31. Curated Content: Characters of Interest Storing JSON in Postgres Millions

    of characters, JSON blobs are much bigger than single line quotes Dynamic unstructured nature means dynamic sizes of each record How can we keep unstructured data without having to parse it all?
  32. Curated Content: Characters of Interest Storing JSON in Postgres Multiple

    Tables? Many Relationships? Partitions? Is there something better?
  33. Full text JSON Searching Postgres 10 • Store whole request

    bodies for inspection • Native indexing in Postgres • Allows for more NoSQL style operations ◦ dynamic schema flexibility • Development agility in initial stages Desire for searching JSON
  34. Full text JSON Searching Postgres 10 Curated Content: Characters of

    Interest ActiveRecord.migration do change do create_table :characters do primary_key :id integer :media_id text :name jsonb :metadata end execute <<- SQL CREATE INDEX characters_metadata ON characters USING gin (( to_tsvector('english',metadata) ) SQL end end Most importantly: Unstructured Data flexibility!
  35. Benefits of JSON Full Text Searching All data stays colocated

    Indexes can help filter data based on existing or missing JSON keys More perspectives of character behavior from the existing data Genres, strengths, weaknesses, skills (Hammers anyone?) Full text JSON Searching Postgres 10 Curated Content: Characters of Interest
  36. Revisiting Quotes: Another feature request People want to be re-inspired,

    “On this day…” Replay a stream of older Quote posts Stream should include all Posters I follow or like, not just my posts Even getting last month’s posts seem slow though, they’re so many I want Throwback Quotes
  37. Postgres 10 introduces Parallel Query support for more query types

    Bitmap Heap Scan Gather Merge Merge Joins Revisiting Quotes: Another feature request I want Throwback Quotes
  38. Postgres 9.6 Parallel Query Why sequential scan? - Main table

    of a query has to be scanned - Scan has to be able to be broken into parts for the workers - Postgres 9.6 only supported this directly on tables - This created sequential scans, which can be very slow Parallel Sequential Scans
  39. Postgres 10 Parallel Query optimization Improvements made - Now process

    scan the index of main table - Builds the list of Heap pages to check in memory - Scan can be broken into parallel to check through the heap pages - Loads less data, requires less caching Parallel Bitmap Heap Scans
  40. Postgres 9.6 Parallel Query What is the Gather mechanism? -

    Postgres 9.6 introduced it as a way to collect results from workers - Results were collected in arbitrary orderings from all workers Gather: Worker result collection
  41. Postgres 9.6 Parallel Query What is the Gather mechanism? -

    If workers are producing sorted output, this adds overhead - Sorting parallel chunks, merge randomly, then sort again, seems good - What about really large result sets? (Remember we have lots of quotes) Gather: Worker result collection
  42. Postgres 9.6 Parallel Query Gather: Worker result collection What are

    my primes? - 17 is the first node visited - Order based on visit point and total items (17,19, 11, 2, 23, 31) - Another pass is required to sort after gathering matching results - Workers returned as fast as possible, no location context
  43. Postgres 10 Parallel Query What is the Gather Merge? -

    Speed up the parallel worker results gathering - Sorts are taken into account during parallelization - Removes much overhead doing the sort after the fact - Especially beneficial result set is large Gather Merge: Optimized Worker result collection
  44. Postgres 10 Parallel Query Gather Merge: Optimized Worker result collection

    - 17 is the first node visited - Direction, location considered when workers collect results - B-tree construction will naturally consider leaf vs. root node locations (insertion balancing) Revisit: What are my primes?
  45. Postgres 10 Parallel Query Gather Merge: Optimized Worker result collection

    - Tree collapsing happens in parallel now - Order will be more intuitive and require less resorting Revisit: What are my primes?
  46. Postgres 9.6 Parallel Query What is Quotes JOIN Customers actually

    doing? - Postgres 9.6 only allowed parallelization on nested loops or hash joins - Other parallel portions cannot perform the merge joins further up - Indexes, ordering all adding overhead, not speeding up requests Merge Join: All followed Posters’ Quotes
  47. Postgres 10 Parallel Query What’s the difference in the Parallel

    Merge Join? - The parallel merge joins can now be done at all parallel portions of work - Outer sides of JOINs are parallelized more effectively - Combined with Parallel Index scans, higher efficiency - Natural ordering by default from the algorithm without specifying ORDER Parallel Merge Join: Optimized ordering
  48. Postgres 10 Query Analyzer Engineering Support: Filing and tracking issues

    Postgres 9.6 Query Analyzer vs. Issues table -> 1.8M records, 495 MB
  49. Postgres 9.6 Query Analyzer accuracy EXPLAIN ANALYZE SELECT "issues".* FROM

    "issues" WHERE "issues"."repo_id" = 2151 AND "issues"."state" = 'open' ORDER BY created_at DESC LIMIT 20 OFFSET 0; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Limit (cost=4267.01..4267.02 rows=20 width=1233) (actual time=935.987..935.992 rows=20 loops=1) -> Sort (cost=4267.01..4268.21 rows=2384 width=1233) (actual time=935.985..935.988 rows=20 loops=1) Sort Key: created_at DESC Sort Method: top-N heapsort Memory: 31kB -> Index Scan using index_issues_on_repo_id_and_id on issues (cost=0.08..4254.33 rows=2384 width=1233) (actual time=2.877..934.092 rows=2159 loops=1) Index Cond: (repo_id = 2151) Planning time: 7.061 ms Execution time: 936.042 ms
  50. Postgres 10 Query Analyzer accuracy EXPLAIN ANALYZE SELECT "issues".* FROM

    "issues" WHERE "issues"."repo_id" = 2151 AND "issues"."state" = 'open' ORDER BY created_at DESC LIMIT 20 OFFSET 0; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Limit (cost=3985.46..3985.47 rows=20 width=1232) (actual time=6.768..6.771 rows=20 loops=1) -> Sort (cost=3985.46..3986.58 rows=2222 width=1232) (actual time=6.767..6.768 rows=20 loops=1) Sort Key: created_at DESC Sort Method: top-N heapsort Memory: 39kB -> Index Scan using index_issues_on_repo_id_and_id on issues (cost=0.08..3973.64 rows=2222 width=1232) (actual time=0.062..5.952 rows=1980 loops=1) Index Cond: (repo_id = 2151) Planning time: 0.745 ms Execution time: 6.815 ms
  51. Planning time: 7.061 ms Execution time: 936.042 ms Vs Planning

    time: 0.745 ms Execution time: 6.815 ms
  52. Gotchas with PostgreSQL 10 Rails 4 doesn’t support PostgreSQL 10,

    Only up to PostgreSQL 9.6 Partitioning is not native to any Ruby ORM Postgres-specific index types (JSON, Hash) are not native to any Ruby ORM Analytics for queries and indexes have to be done in raw SQL
  53. Heroku Postgres improvements released Added support for numerous native extensions

    Managed Postgres Credentials Released hardware new plans with dedicated hardware Improved Postgres CLI analytics from improved Postgres query analytics Heroku Postgres Connection Pooling
  54. Heroku Postgres extensions enabled Made a list, vetted it twice!

    Top 5 Favorites added lo sslinfo tcn timetravel address_standardizer For the complete list, checkout the Heroku Postgres Extensions article
  55. Postgres permissions difficult to reason about Creates more simplified view

    of Postgres Roles Still allows for customizable permissions on Roles and Tables Heroku Postgres Managed Credentials What are managed credentials?
  56. Heroku Postgres Managed Credentials Footguns considered, guard rails added •

    Special characters (No one likes Bobby Tables) • Name lengths • Best practices published What are managed credentials?
  57. Heroku Postgres Managed Credentials Separation of concerns Principle of least

    privilege Better control of external access: • E.g. Don’t want third-party analytics reading PII data Why use managed credentials? For more info, check out the Heroku Postgres Credentials Devcenter Article
  58. Heroku PGX Plans Released Benefits to using PGX Optimized for

    heavier I/O work Better database and disk caching More stable performance Increased accuracy performing query analytics Enhanced CPU capability for optimizing Parallel Queries
  59. Heroku PGX Plans Released Dedicated Provisioned Disk I/O Doubled CPU

    Cores at tier-4 and up RAM sizes doubled at tier-4 and up Increased disk sizes at every tier Prices stayed the same or lowered! Big Takeaways Read more about the Heroku Postgres PGX plans on our blog post
  60. Heroku Postgres CLI analytics Database tuning and monitoring still vital

    - https://devcenter.heroku.com/articles/heroku-postgres-database-tuning Extended CLI analytics improved from Postgres 10 internals changes - https://github.com/heroku/heroku-pg-extras heroku pg:diagnose #and more!
  61. Heroku Connection Pooling: PgBouncer Beta • Manages connections for your

    Postgres Database ◦ Reduces need to open new connections • Lives on the Database Server • Uses Transactional connection Pooling What’s PgBouncer?
  62. Heroku Connection Pooling: PgBouncer Beta • Better Visibility on connection

    statistics • Exposes guard rails to protect your database • Excellent for high-throughput asynchronous workloads Why use Connection pooling?
  63. Heroku Connection Pooling: PgBouncer Beta On Server Connection pooling is

    only available on Postgres 9.6, 10 - Read more about it: Postgres connection pooling - Still in public beta, more features to come! - Default connections direct to DBs still available after enabling Still offering Client-side buildpack - Client-side PgBouncer article