How Uber Built a Multi‑Region Kafka Architecture for Disaster Recovery
Uber operates the world’s largest Kafka cluster, handling trillions of messages daily, and has engineered a multi‑region deployment with active/active and active/passive consumption modes, offset management, and uReplicator to ensure high‑availability and seamless disaster recovery across data centers.
Uber's Kafka Ecosystem
Uber runs the world’s largest Kafka cluster, processing tens of trillions of messages and petabytes of data each day. Kafka is the backbone of Uber’s tech stack, supporting a publish/subscribe bus for rider and driver events, streaming analytics platforms such as Apache Samza and Flink, database change‑log streaming, and ingestion into Uber’s Hadoop data lake.
Multi‑Region Deployment
To achieve scalability, reliability, high performance, and ease of use, Uber built a multi‑region Kafka infrastructure. The design provides business resilience by deploying services and backups across distributed data centers, allowing traffic to continue in other regions if one region becomes unavailable.
Two types of clusters are used: regional clusters where producers publish locally, and a global aggregate cluster that receives replicated data to provide a unified view. The diagram below shows the replication topology between two regions.
Active/Active Consumption Mode
In the active/active mode, consumers in each region read from the global aggregate cluster. When a region fails, consumers automatically switch to the other region because the same data is available in both places. Uber’s dynamic pricing service uses this pattern to maintain price calculations across regions.
Active/Passive (Master‑Slave) Consumption Mode
The active/passive mode allows only one consumer (identified by a unique name) to read from the aggregate cluster in the primary region. Offsets are tracked and replicated to other regions. If the primary region fails, the consumer fails over to the secondary region and resumes from the last synchronized offset, ensuring strong consistency for services such as payment processing.
Offset synchronization is critical because messages may arrive out of order across regions due to replication latency. The diagram illustrates divergent ordering after cross‑region replication.
Offset Management Service
Uber built a dedicated offset‑management service to map checkpoints between regional and aggregate clusters. uReplicator periodically records offset mappings, which are stored in a highly‑available database. A synchronization job keeps offsets consistent across regions, enabling passive consumers to resume quickly after a failover.
Conclusion
At Uber, continuous business operation depends on uninterrupted cross‑service data flow, with Kafka playing a pivotal role in disaster‑recovery strategies. The multi‑region architecture described here provides robust failover mechanisms, but further work remains to achieve fine‑grained recovery that tolerates individual cluster failures without full region failover.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
