TNS
VOXPOP
You’re most productive when…
A recent TNS post discussed the factors that make developers productive. You code best when:
The work is interesting to me.
0%
I get lots of uninterrupted work time.
0%
I am well-supported by a good toolset.
0%
I understand the entire code base.
0%
All of the above.
0%
I am equally productive all the time.
0%
Cloud Services / Data / Operations

ISO Better Scaling, Instacart Drops Postgres for Amazon DynamoDB

Instacart recently switched to Amazon Web Services' DynamoDB from Postgres, and the engineering team is now sharing the schema design that reduced the number of billable writes-per-transaction by over half. 
Nov 29th, 2022 7:56am by
Featued image for: ISO Better Scaling, Instacart Drops Postgres for Amazon DynamoDB

Grocery delivery startup Instacart recently switched Amazon Web ServicesDynamoDB from the open source Postgres database system, and the engineering team is now sharing the schema design that reduced the number of billable writes-per-transaction by over half. 

Instacart is in the process of migrating its primary data store over to Amazon’s DynamoDB. The current system, Postgres running on Amazon EC2, was pushing its limits. In a blog post, InstaCart Software Engineer Manas Paldhe tells the deciding factor to migrate came down to something seemingly so innocuous but equally important to people who need them: push message notifications.

Instacart engineers recently decided that Postgres and Amazon EC2 instances were no longer working in their current form with the push message notifications. Rather than refactor the system they currently had, company engineers decided to move forward with DynamoDB. They had to completely rework schema design and data modeling to reduce the usage costs and make it more cost-effective for the number of writes they have.

Patient Zero — Push Message Notifications…

Push message notifications are the primary method of communication between Instacart and its users…

   

Postgres stored the state machine around messages sent to a user. Messages scale linearly based on the number of people ordering and having their deliveries shopped for. Instacart also had some new features slated to release that would add about 8x more notifications on top of this daily baseline. A single Postgres cluster wouldn’t cut it any longer.

As busy as the daytime/evening hours are is equivalent to how quiet the nighttime/very early morning hours are. Postgres didn’t scale based on demand and having a database that did scale would be at great benefit to Instacart. After some testing, it was DynamoDB for the win.

How DynamoDB Does Pricing

Instacart’s main concern about DynamoDB was cost, not latency or scaling requirements. This version of an apples-to-apples comparison was comparing different DynamoDB options to a sharded Postgres cluster.

“DynamoDB cost is based on the amount of data stored in the DynamoDB table, and the number of reads and writes to it,” Paldhe explained. The first part is “simple to estimate as long as you know how much data you have.”

Reads and writes are trickier. There are two options for usage costs, fees associated with the reads and writes, “pay per request” and “provisioned capacity.” Instacart uses the “provisioned capacity” option as they say the “pay-per-request” option gets “prohibitively expensive fast” if there are a lot of requests. Instacart says, “the provisioned-capacity mode and autoscale the capacity up and down [allows them to] maintain a comfortable headroom.”

Writes are approximately 10x more expensive than reads and so significant data modeling had to take place to make the Postgres tables an acceptable cost for DynamoDB.

The New Data Model for DynamoDB

The first step was to see what price the current Postgres table came to at DynamoDB so the team could see where to start optimizations.

The first area targeted for improvement was the writes, obvs.

Optimizing for Writes Here’s the problem: Stored messages stay in the database for seven days based on historical user preferences. This means table storing is data write heavy with most of the loads being from inserts and updates.

AWS charges for fetching data on a Read Capacity Unit (RCU). “One-half of a RCU is consumed for an eventually-consistent row read that is smaller than 4KB in size,” Paldhe said.

Write Capacity Units (RCUs) work on a  similar design. Each message is over 1KB and its lifespan includes three updates meaning a single message consumes at least 2 WCUs for each write.

The write updates occurred on the primary key of a UUID but reads flowed through an index in Postgres on the recipient of the notification. To match Postgres’s functionality, DynamoDB required a global secondary index (GSI) on the recipient. Add another WCU to the count.

In total, each record would require 9 WCUs, using the Postgres schema.

The Postgres Data Table for Illustration

This table was inefficient for all reasons listed above and because each update meant that a single field on the object was updated. In the name of saving some WCUs, the original table design was updated to the single table design pattern.

The table below shows the new schema.

This new table shows they both share the same partition key (UUID) but the sort keys are now different. The first stores the JSON object as an attribute while the second holds all the timestamps and metadata.

The expected capacity requirement was reduced by 2 WCUs per record lifecycle. The revised equation is as follows:

3 WCU to create a record (1 WCU for the metadata item + 2 WCU for the item containing the > 1KB JSON object)

+ 1 WCU for GSI
+ 3 WCU for three updates (metadata item only)
= 7 WCU per record life-cycle.

Reducing WCUs Based on Reads

This work shaved two WCUs from a write, but could the team cut further? What about data that was getting written but never read such as the index? The index was always written but rarely read. This go-around for optimization reading was the important factor since the table was split into multiple rows.

Could Instacart reduce data size and eliminate the GSI by changing the way the data was read?

Here’s how they optimized for size: Message metadata was stored as a large JSON object with 75% of the rows over 1KB. The largest part of the message by far was the JSON field. For the size reduction, Instacart used GZip, known for its high compression ratio for JSON. The messages are now compressed smaller than 1KB more than 99% of the time.

Next, their sights were set on making the primary key useful enough to eclipse the need for the GSI. The GSI was necessary because Postgres used an index but an index wasn’t explicitly needed, per se. The GSI was replaced with a concatenation of the userType and userID.

The second half of the image above is the Range Key. This was added because a single user receives many messages rendering the original key non-unique but the combination of the partition and the Range Key is unique. The time stamp was added because they are sorted and fetched by time. The random id was added for message identification when multiple messages are received by the recipient at the same time.

Now the system can query the partition and filter the range key for bulk read operations rather than building and leveraging GSI. This did require logic refactoring because of a large number of changes but, to paraphrase Paldhe, “it was worth it.”

The new equation is as follows:

1 WCU * (1 Insert)
+ 1 WCU * (3 Updates)
= 4 WCUs per record lifecycle!

Rollout

The schema was optimized and ready to roll out. The team working on this project didn’t want the ease of integration to be a blocker so the rollout was nice and slow. To aide with a friendly rollout spirit, and to help developers smoothly transition from Postgres to DynamoDB, the engineers chose to thinly wrap an open source library (Dynamoid) that exposed a similar interface to the ActiveRecord to which they were already accustomed.

As a huge ancillary benefit, Dynamoid allowed engineers working on the DynamoDB project to include the tools that helped propel the project forward (field compression, time-sorted identifiers and compound partition keys) in a simple API.

The first part of the rollout included dual writing with reading hidden behind a feature flag. Since only needed seven days of retention were needed,  the reading switch began to ramp up after a week. The rollout was so smooth that just a week after launch the Postgres codepath was eliminated, and the database was downsized.

Conclusion

In the past six months, Instacart’s use of DynamoDB grew from one to more than 20 tables supporting five to 10 different features across different internal organizations. The engineering team learned more about large, sensitive rollouts. They also learned quite a bit about DynamoDB, and its many supporting tools.

The goal of this undertaking was to be a “trailblazer project,” Paldhe explained, and in that regard, it looks to have been a tremendous success.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Unit.
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.