Designing Uber’s High‑Availability Messaging System: Fault Tolerance, Sharding, and Multi‑Data‑Center Strategies
The article details Uber senior engineer Zhao Lei’s presentation on building a highly available messaging platform, covering single‑point failure mitigation, sharding approaches, large‑scale outage handling, cross‑region failover, and the practical engineering practices and protocols used to keep billions of users online.
Zhao Lei, a senior engineer at Uber with prior experience at Microsoft and Facebook, presented "Uber High‑Availability Messaging System Construction" at the global architect summit in Shenzhen, sharing deep insights into distributed fault‑tolerance design.
Distributed System Single Point Failure Handling
Stateless, non‑sharded services can avoid single‑point failures by using load balancers that perform periodic health checks and remove unhealthy nodes from the pool. Simple TCP or HTTP health endpoints may miss business‑logic failures, so adding lightweight business checks to the health endpoint is recommended.
For transient network issues causing RPC call failures, clients typically implement exponential backoff retries. Retries are safe for TCP connection failures but can cause side effects (e.g., duplicate DB writes) on receive timeouts, requiring idempotent service design.
Sharded services need mechanisms to locate failures and propagate updated membership. Two common sharding schemes are described:
Hash the key space into many small shards (e.g., 4K). A single master, elected via ZooKeeper, assigns shards to nodes and coordinates health checks; on master failure a backup takes over.
Consistent hashing, often used for cache clusters, where each client independently health‑checks nodes. To maintain consistency under partitions, Uber employs a gossip protocol for collaborative health monitoring.
Handling Large‑Scale Failures
When a rack‑switch or similar failure reduces capacity, Uber aims to serve a fraction of traffic (e.g., 50%) with the remaining machines, ensuring the service does not completely collapse. Throughput should scale linearly with the number of machines, but CPU saturation above ~80% degrades latency and can trigger cascade failures.
Async I/O frameworks (Node.js, Java NIO) can suffer from event‑loop lag, causing connection buildup and file‑descriptor exhaustion. Uber mitigates this by limiting the number of concurrent connections per Thrift server, allowing the service to stay within a safe QPS range.
Entire Data‑Center Failure Handling
Uber uses "soft failover" for ongoing trips, waiting until completion before migrating users to another data center; hard failover involves DNS redirection and HTTP redirects to quickly route clients to a backup region.
To avoid the thundering‑herd problem during massive failovers, clients implement exponential backoff and load balancers apply rate‑limiting or global black‑hole switches.
Multi‑region deployments rely on independent power and cooling, with high‑throughput low‑latency links within a region and higher‑latency links between regions. Deploying across multiple availability zones further improves resilience.
Q&A Highlights
Health‑check endpoints can embed simple business logic without adding significant load.
State migration during region switches is handled by encrypted, real‑time state downloads on the client; non‑real‑time data uses master‑master SQL replication.
GET requests are safe to retry; POST retries are only safe on connection timeouts or when the service guarantees idempotency.
Rate‑limiting and circuit‑breaking libraries are used, though per‑user selective handling is not yet implemented.
Shard map look‑ups query the master; clients cache the map and refresh it periodically.
Uber stores full user data in each data center, while Facebook stores only a subset per data center.
Data replication strategies include PostgreSQL native replication for real‑time data, MySQL replication for trip data, and plans to adopt Riak for cross‑data‑center availability.
Messaging protocol is HTTPS‑based; Uber is exploring HTTP/2.0 and SPDY via Nginx extensions or Facebook’s Proxygen.
Uber’s RPC stack is transitioning from HTTP+JSON to TChannel+Thrift, with a custom Node.js Thrift implementation (thriftify) and fail‑fast logic placed in routing services.
Global kill‑switches and feature‑flagging are used to limit traffic during overload, allowing selective degradation of services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
