Databases 14 min read

Discord’s Migration from MongoDB to Cassandra: Architecture, Data Modeling, and Lessons Learned

This article details how Discord scaled from tens of millions to over a hundred million daily messages by replacing MongoDB with Cassandra, covering the motivations, data model design, handling of tombstones, performance results, unexpected issues, and future scalability plans.

High Availability Architecture
High Availability Architecture
High Availability Architecture
Discord’s Migration from MongoDB to Cassandra: Architecture, Data Modeling, and Lessons Learned

Discord’s voice‑chat service experienced explosive growth, reaching over 120 million messages per day by early 2017, which made the original MongoDB setup unsustainable due to memory pressure and latency.

To meet requirements such as linear scalability, automatic failover, low maintenance, proven technology, predictable performance, non‑binary storage, and open‑source control, the engineering team selected Apache Cassandra as the new primary datastore.

The migration began by analyzing read/write patterns: random reads, a 50/50 read‑write ratio, and different server workloads (voice, private, and large public chat servers). A data model was designed using a composite primary key of (channel_id, bucket, message_id) , where channel_id serves as the partition key and message_id (a Snowflake‑style ID) provides ordering within the partition.

Buckets of roughly ten‑day intervals were introduced to keep partition sizes below 100 MB, avoiding large partitions that cause GC pressure and hinder data distribution.

During the rollout, a dual‑write approach (MongoDB + Cassandra) was used. An early bug revealed null author_id values, leading to a policy of writing only non‑null columns to prevent unnecessary tombstones.

Cassandra’s eventual‑consistency model (AP) and “last write wins” upserts were leveraged, with tombstones expiring after a configurable period (reduced from 10 days to 2 days) to mitigate storage bloat.

Performance testing showed write latencies typically under 1 ms and reads under 5 ms, with stable metrics over a week of production traffic. The system handled rapid jumps to messages from a year ago without noticeable slowdown.

Six months after the migration, a “giant surprise” occurred when a public channel with only one visible message caused long GC pauses because millions of tombstones from mass deletions were still being scanned. The issue was resolved by shortening tombstone TTL and improving query logic to skip empty buckets.

Looking ahead, Discord runs a 12‑node cluster with a replication factor of three, plans to scale to billions of daily messages, and is evaluating upgrades to Cassandra 3 and alternative solutions like ScyllaDB for faster repairs.

In conclusion, the switch to Cassandra delivered the required high availability, scalability, and performance, enabling Discord to continue growing its chat history while maintaining a reliable user experience.

scalabilityHigh AvailabilityData Modelingdatabase migrationCassandraDiscord
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.