Databases 25 min read

How Discord Scaled to Billions of Messages: From MongoDB to Cassandra to ScyllaDB

Discord’s rapid growth forced a series of massive database migrations—from MongoDB to Cassandra in 2017, then to ScyllaDB in 2023—detailing the motivations, requirements, data‑model design, performance challenges, migration tooling, and the resulting operational improvements.

dbaplus Community

Dec 26, 2023

How Discord Scaled to Billions of Messages: From MongoDB to Cassandra to ScyllaDB

Background and Motivation

By early 2017 Discord stored all chat messages in a single MongoDB cluster. Rapid user growth pushed message volume from 40 million in July 2016 to over 120 million by January 2017, exhausting RAM for indexes and causing unpredictable latency.

First Migration: MongoDB → Cassandra (2017)

Discord needed a scalable, fault‑tolerant, low‑maintenance store for billions of messages. After evaluating read/write patterns—highly random reads, 50/50 read/write ratio, and heavy‑weight voice‑chat servers that rarely send messages—they defined a set of requirements:

Linear scalability : add nodes without re‑sharding.

Automatic failover : minimal night‑time intervention.

Low maintenance : once configured, the system should run unattended.

Proven feasibility : avoid bleeding‑edge tech.

Predictability : 95 % of API responses under 80 ms.

No blob storage : avoid storing large serialized blobs.

Open source : avoid vendor lock‑in.

Cassandra satisfied all criteria: it offers linear scalability by adding nodes, tolerates node loss, and is open‑source. Major users such as Netflix and Apple have deployed thousands of Cassandra nodes.

Data Modeling for Cassandra

Discord modeled messages as a key‑key‑value (KKV) store. The partition key is channel_id (identifying a Discord channel) combined with a time‑bucket, and the clustering key is message_id (a Snowflake‑generated, time‑sortable ID). This yields the following minimal schema:

CREATE TABLE messages (
  channel_id bigint,
  bucket int,
  message_id bigint,
  author_id bigint,
  content text,
  PRIMARY KEY ((channel_id, bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

Using buckets of roughly ten days keeps partition size below 100 MB, avoiding the large‑partition GC pressure that Cassandra suffers.

Handling Tombstones and Write Patterns

Cassandra treats deletions and null writes as tombstones. Because Discord’s message schema has many nullable columns, each insert generated dozens of tombstones, inflating write load. The team solved this by writing only non‑null columns.

Performance Observations

Writes consistently stayed under 1 ms, reads under 5 ms, with stable latency over a week‑long test. However, after six months the cluster began experiencing 10‑second stop‑the‑world GC pauses, traced to a single high‑traffic public server that generated millions of tombstones.

Mitigations for the GC Issue

Reduced tombstone TTL from 10 days to 2 days via nightly repair jobs.

Modified query code to skip empty buckets, limiting scans to recent data.

Second Migration: Cassandra → ScyllaDB (2023)

By 2023 the 12‑node Cassandra cluster (replication factor 3) showed severe latency spikes. ScyllaDB, a C++‑based Cassandra‑compatible database, promised better performance, faster repairs, and a garbage‑collector‑free architecture.

Evaluating ScyllaDB

Scylla’s C++ implementation eliminates Java GC pauses, a major pain point for Discord. After internal testing showed superior repair times and lower latency, the team decided to migrate all services except the legacy cassandra‑messages database.

Data Service Layer in Rust

To reduce load on the database, Discord built a Rust‑based data service that sits between the API and the database cluster. It provides gRPC endpoints, merges concurrent requests for the same row, and routes requests based on channel ID, dramatically cutting peak traffic.

Large‑Scale Data Migration

The migration plan required moving trillions of messages without downtime. Steps included:

Deploy a new ScyllaDB cluster on SSD‑backed storage with RAID for durability.

Run dual‑writes to Cassandra and ScyllaDB for new data.

Use a custom Spark‑based migration tool to stream token ranges from Cassandra to ScyllaDB, compressing tombstone‑heavy partitions on‑the‑fly.

Validate data by sampling reads from both clusters.

The migration completed in nine days, achieving a peak transfer rate of 3.2 million messages per second. After validation, ScyllaDB became the primary store.

Post‑Migration Results

Node count dropped from 177 Cassandra nodes to 72 ScyllaDB nodes, each with 9 TB of storage (vs. 4 TB per Cassandra node). Latency improved dramatically: p99 read latency fell from 40‑125 ms to ~15 ms, and write latency stabilized at 5 ms p99.

Future Plans

Discord intends to upgrade the remaining Cassandra cluster to version 3 (reducing storage size by >50 %), explore further ScyllaDB performance tuning, and consider archiving inactive channels to Google Cloud Storage.

Conclusion

Through careful requirement analysis, data modeling, and incremental migration tooling, Discord successfully transitioned from MongoDB to Cassandra and finally to ScyllaDB, achieving linear scalability, predictable latency, and reduced operational overhead while supporting billions of daily messages.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Scalability Rust data modeling database migration cassandra ScyllaDB Discord

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.