Operations 9 min read

Why Roblox’s Three‑Day Outage Happened: Consul Streaming Bug and BoltDB Design Flaw

Roblox’s detailed post‑mortem reveals that a three‑day outage was caused by a Consul streaming bug and a design flaw in BoltDB’s freelist, which together created CPU contention and latency spikes on its massive on‑premises infrastructure, leading the team to disable streaming, add a second data‑center, and redesign their architecture.

21CTO
21CTO
21CTO
Why Roblox’s Three‑Day Outage Happened: Consul Streaming Bug and BoltDB Design Flaw

Roblox, which serves over 50 million teens and pre‑teens, published a lengthy post‑mortem describing a three‑day outage that occurred last year.

The company runs more than 18 000 servers and manages its own storage and networking, heavily relying on HashiCorp’s Nomad, Vault and Consul. Consul, part of the emerging service‑mesh technology, played a key role in diagnosing the incident.

The outage began as a harmless symptom but was later traced to a new bug deep in the software layer that runs Roblox’s infrastructure. Specifically, a streaming feature introduced in Consul caused severe CPU contention and latency due to the way it used Go channels under high read‑write load.

Roblox’s architecture, which uses many‑core NUMA servers, amplified the problem. The streaming code path also triggered excessive work in BoltDB’s “freelist” structure, a design flaw that makes updating the list of free pages costly when large amounts of data are written and deleted.

To mitigate the issue, the team disabled all Consul streaming, added a second data‑center, and began planning availability zones. HashiCorp is developing a new Consul version that replaces BoltDB.

Further analysis showed intermittent leader elections in Consul clusters and slow leaders that matched the pre‑disable‑streaming latency patterns. The team prevented problematic leaders from being elected, allowing the rest of the cluster to remain healthy while services were restored.

Developer commentary confirmed that BoltDB was never intended for production use at this scale and that its freelist implementation leads to large write‑latency spikes under Roblox’s workload.

Performance graphs (included below) illustrate CPU spin‑lock contention, high core utilization, and the impact of the freelist on latency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Service MeshConsulInfrastructurepostmortemOutageBoltDBRoblox
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.