Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalable Uniques in Postgres with HLL

Scalable Uniques in Postgres with HLL

Craig Kerstiens

September 18, 2013
Tweet

More Decks by Craig Kerstiens

Other Decks in Technology

Transcript

  1. Truviso • Extended Postgres to do streaming • Various markets

    • Ad space • Wanted unique impressions • Sort of wanted unique impressions
  2. HyperLogLog • KMV - K minimum value • Bit observable

    patterns • Stochastic averaging • Harmonic averaging
  3. HyperLogLog • KMV - K minimum value • Bit observable

    patterns • Stochastic averaging • Harmonic averaging
  4. HyperLogLog • KMV - K minimum value • Bit observable

    patterns • Stochastic averaging • Harmonic averaging • Implemented by Aggregate Knowledge
  5. Use cases • Semi distinct count • Think pg_stat_statements •

    Ad networks • Web traffic • With rollups/groupings
  6. Digging in CREATE  EXTENSION  hll;    CREATE  TABLE  helloworld  (

               id        integer,            set      hll    );
  7. Digging in CREATE  EXTENSION  hll;    CREATE  TABLE  helloworld  (

               id        integer,            set      hll    );
  8. Inserting data UPDATE  helloworld   SET  set  =  hll_add(set,  hll_hash_integer(12345))

      WHERE  id  =  1; UPDATE  helloworld   SET  set  =  hll_add(set,  hll_hash_text('hello  world'))   WHERE  id  =  1;
  9. Real world CREATE  TABLE  daily_uniques  (        date

                           date  UNIQUE,        users                      hll );
  10. Real world SELECT              

     EXTRACT(MONTH  FROM  date)  AS  month,                hll_cardinality(hll_union_agg(users)) FROM  daily_uniques WHERE  date  >=  '2012-­‐01-­‐01'  AND            date  <    '2013-­‐01-­‐01' GROUP  BY  1;
  11. Real world SELECT              

     EXTRACT(MONTH  FROM  date)  AS  month,                hll_cardinality(hll_union_agg(users)) FROM  daily_uniques WHERE  date  >=  '2012-­‐01-­‐01'  AND            date  <    '2013-­‐01-­‐01' GROUP  BY  1;
  12. Good practices • It uses update • Do as a

    batch in most cases • Tweak the config
  13. Tuning Parameters • log2m - log base 2 of registers

    • Between 4 and 17 • Each 1 increase doubles storage
  14. Tuning Parameters • log2m - log base 2 of registers

    • Between 4 and 17 • Each 1 increase doubles storage • regwidth - bits per register
  15. Tuning Parameters • log2m - log base 2 of registers

    • Between 4 and 17 • Each 1 increase doubles storage • regwidth - bits per register • expthresh - threshold for explicit vs sparse
  16. Tuning Parameters • log2m - log base 2 of registers

    • Between 4 and 17 • Each 1 increase doubles storage • regwidth - bits per register • expthresh - threshold for explicit vs sparse • spareson - on/off for sparse