Big Data 10 min read

How Uber Built a Multi‑Region Kafka Architecture for Disaster Recovery

Uber operates the world’s largest Kafka cluster, handling trillions of messages daily, and has engineered a multi‑region deployment with active/active and active/passive consumption modes, offset management, and uReplicator to ensure high‑availability and seamless disaster recovery across data centers.

Programmer DD

Feb 20, 2021

How Uber Built a Multi‑Region Kafka Architecture for Disaster Recovery

Uber's Kafka Ecosystem

Uber runs the world’s largest Kafka cluster, processing tens of trillions of messages and petabytes of data each day. Kafka is the backbone of Uber’s tech stack, supporting a publish/subscribe bus for rider and driver events, streaming analytics platforms such as Apache Samza and Flink, database change‑log streaming, and ingestion into Uber’s Hadoop data lake.

Multi‑Region Deployment

To achieve scalability, reliability, high performance, and ease of use, Uber built a multi‑region Kafka infrastructure. The design provides business resilience by deploying services and backups across distributed data centers, allowing traffic to continue in other regions if one region becomes unavailable.

Two types of clusters are used: regional clusters where producers publish locally, and a global aggregate cluster that receives replicated data to provide a unified view. The diagram below shows the replication topology between two regions.

Figure 2: Kafka replication topology between two regions

Active/Active Consumption Mode

In the active/active mode, consumers in each region read from the global aggregate cluster. When a region fails, consumers automatically switch to the other region because the same data is available in both places. Uber’s dynamic pricing service uses this pattern to maintain price calculations across regions.

Figure 3: Active/Active consumption architecture

Active/Passive (Master‑Slave) Consumption Mode

The active/passive mode allows only one consumer (identified by a unique name) to read from the aggregate cluster in the primary region. Offsets are tracked and replicated to other regions. If the primary region fails, the consumer fails over to the secondary region and resumes from the last synchronized offset, ensuring strong consistency for services such as payment processing.

Offset synchronization is critical because messages may arrive out of order across regions due to replication latency. The diagram illustrates divergent ordering after cross‑region replication.

Figure 4: Cross‑region message ordering and checkpoint

Offset Management Service

Uber built a dedicated offset‑management service to map checkpoints between regional and aggregate clusters. uReplicator periodically records offset mappings, which are stored in a highly‑available database. A synchronization job keeps offsets consistent across regions, enabling passive consumers to resume quickly after a failover.

Conclusion

At Uber, continuous business operation depends on uninterrupted cross‑service data flow, with Kafka playing a pivotal role in disaster‑recovery strategies. The multi‑region architecture described here provides robust failover mechanisms, but further work remains to achieve fine‑grained recovery that tolerates individual cluster failures without full region failover.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

kafka Disaster Recovery Active-Active multi-region Active-Passive Offset Management uReplicator

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.