Scaling Coinbase’s Platform for Spikes in Customer Demand: Lessons, Monitoring, and Traffic Replay

Since 2017, Coinbase has faced rapid cryptocurrency‑driven traffic growth, prompting a series of backend engineering improvements—including database upgrades, monitoring enhancements, relationship refactoring, caching, and a custom traffic capture‑replay system—to ensure reliability and scalability during demand spikes.

High Availability Architecture
High Availability Architecture
High Availability Architecture
Scaling Coinbase’s Platform for Spikes in Customer Demand: Lessons, Monitoring, and Traffic Replay

Since 2017, global interest in cryptocurrency has caused Coinbase’s traffic to surge from a stable 25,000 API requests per minute to well beyond its 100,000‑request red line, exposing reliability and scalability challenges that the engineering team needed to address.

During the 2017 peak, service outages occurred when traffic exceeded the red line; the team responded by vertically scaling, upgrading MongoDB versions, optimizing indexes, and splitting hot collections, but continued growth required deeper investigation.

Enhanced monitoring revealed extreme latency (up to 100×) and a mismatch between Ruby server processing times and MongoDB transaction times, leading to “ghost” queries. To diagnose, the team instrumented the MongoDB driver to log queries exceeding latency thresholds, capturing request/response sizes, source code lines, and other metadata, which fed a detailed dashboard.

The dashboard showed a large number of queries generated during user login and dashboard view, caused by a many‑to‑many user‑device relationship and a poor device‑fingerprint algorithm that grouped many users under a single device. Refactoring this to a one‑to‑many relationship (each device maps to a single user) yielded a dramatic performance boost.

Another bottleneck was heavy reads from hot collections. The team introduced a memcached layer that cached query results before hitting the database and invalidated the cache on writes, placing the cache at the ORM/driver level to keep it decoupled from business logic.

These changes proved effective during the traffic spikes of December and January, allowing Coinbase to handle larger loads with improved stability.

To prepare for future spikes, Coinbase built a traffic capture and replay system. The capture tool (a wrapper around mongoreplay) snapshots traffic directed at a chosen MongoDB cluster, encrypts it, and stores it in S3. The cannon tool replays the captured traffic to a newly launched cluster, supporting configurable replay speed and a 10 MB buffer to merge streams from multiple application servers.

Early use of capture and cannon uncovered that the Ruby MongoDB driver interleaved ping commands with find operations, violating driver specifications and contributing to the earlier ghost‑query behavior.

Overall, the engineering effort demonstrates that, alongside security, reliability is critical for Coinbase’s platform to support customers buying, selling, and managing cryptocurrency during periods of extreme demand.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backendmonitoringtraffic replaycachingMongoDBscaling
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.