How Notion Scaled PostgreSQL with Database Sharding
Notion tackled severe PostgreSQL performance limits by sharding its Block table and related tables across 480 logical shards on 32 physical databases, using workspace IDs as shard keys, a dual‑write migration, and rigorous validation to achieve near‑zero downtime and faster response times.
Sharding Challenge
Notion, a popular productivity platform, relied on a single monolithic PostgreSQL instance for five years, supporting massive growth. By mid‑2020 the database faced increasing CPU spikes, standby requests, and a failing VACUUM process, prompting the need for a sustainable scaling solution.
Sharding Strategy and Design Decisions
The team chose horizontal sharding and identified the Block table as the primary candidate because Notion’s data model is built around blocks that can contain other blocks, forming a hierarchy.
To avoid costly cross‑shard queries, they decided to shard not only the Block table but also all tables linked to it, keeping related data together.
They selected the workspace ID as the shard key, ensuring that all blocks belonging to the same workspace reside on the same shard, which minimizes cross‑shard access.
Notion distributed 480 logical shards across 32 physical databases, assigning 15 logical shards per physical node. This number was chosen because it divides evenly by many values, allowing smooth scaling from 32 to 40 or 48 hosts.
Implementation and Migration Process
The migration began with a dual‑write phase: new writes were recorded in an audit log and applied to both the old and new databases, avoiding direct writes to two databases that could cause inconsistencies.
A custom script copied all existing data to the new sharded setup on a 96‑CPU machine, taking about three days.
Data correctness was verified through two methods:
Running a validation script that compared random UUID records between the old and new databases.
Performing “dark reads” where requests fetched data from both databases, compared results, and logged any differences without exposing the new data to users.
The final cut‑over required only a few minutes of downtime, appearing to users as a brief unavailability while the backend switched to the new sharded architecture.
Lessons Learned
Key takeaways include starting migration early before the monolith becomes a bottleneck, aiming for zero‑downtime cut‑overs, and choosing an appropriate shard key—workspace ID proved effective for co‑locating related data and reducing cross‑shard queries.
Overall, the sharding effort dramatically improved Notion’s performance, handling higher loads with smoother operation and faster response times.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
