HuoLala’s Cost‑Effective Multi‑Zone High Availability via Multi‑Lane Architecture
This article explains how HuoLala designed a cost‑effective multi‑zone high‑availability solution called the multi‑lane architecture, detailing its goals, deployment of services across availability zones, use of Consul for service discovery, Apollo for configuration, traffic scheduling strategies, and how it differs from traditional active‑active setups.
Background
In recent years, cloud provider data‑center incidents have become common, prompting internet companies to seek ways to keep services stable during outages. Common solutions such as same‑city active‑active, two‑region three‑center, and multi‑region active‑active exist, but their complexity grows with diminishing marginal returns. Companies must choose solutions that fit their scale and optimize them locally rather than chasing perfection.
HuoLala set two primary goals: support same‑city multi‑availability‑zone high availability with a recovery time of 30 minutes after a zone failure, and keep the extra IT cost from the architecture under 5% of total IT spending.
Based on these goals, HuoLala evolved a cross‑zone high‑availability design called the multi‑lane architecture.
Multi‑Lane Introduction
The multi‑lane architecture routes business traffic to target availability zones (AZs), with each AZ acting as a "lane" that hosts a complete set of services. The system consists of N lanes, each handling 1/N of the traffic. If an AZ fails, its traffic can be shifted to other lanes within minutes, achieving AZ‑level disaster isolation.
The following sections detail key aspects of the multi‑lane design.
2.1 Application Deployment
Each lane runs a full business chain, but to stay within the 5% cost budget, only 50% of instances are deployed in each of two lanes. This reduces cost but means a single lane cannot handle 100% traffic; when a lane fails, the other lane must be scaled up, a process that takes up to 30 minutes despite one‑click scaling.
2.2 Service Registration and Discovery
Consul is used as a cross‑zone high‑availability service registry. A single Consul cluster with five server nodes spans three AZs; any AZ failure triggers automatic leader election and recovery. All services register with Consul, adding AZ metadata to indicate their lane, enabling global service state tracking.
2.3 Configuration Center
Apollo serves as the configuration center. Each lane corresponds to an Apollo cluster that inherits from a default cluster. When a service starts, the Apollo SDK reads the lane information and requests configuration from the appropriate cluster, falling back to defaults for missing keys. HuoLala extended Apollo to support key‑level inheritance rather than whole‑namespace inheritance for finer‑grained control.
2.4 Infrastructure
Standard high‑availability solutions provided by the components themselves are used for MySQL, RabbitMQ, Elasticsearch, Redis, etc., without custom modifications. Cloud‑based services rely on the provider’s cross‑zone HA mechanisms (e.g., MySQL primary‑secondary, RabbitMQ mirrored queues).
2.5 Traffic Scheduling
Traffic is divided into external and internal flows.
2.5.1 External Traffic Scheduling
Two approaches exist: DNS‑based load balancing (simple but limited) and a gateway‑based solution (more complex but highly extensible). HuoLala chose the gateway approach because matching drivers and users within the same physical region requires routing both to the same lane, which DNS cannot guarantee.
The gateway, deployed across AZs, forwards traffic to a lane based on the originating city, achieving precise lane‑level routing.
2.5.2 Internal Traffic Scheduling
The goal is to keep a request’s entire processing within a single lane to reduce inter‑lane interference and latency. The request flow includes API Gateway, SOA services, RPC calls, message queues, and scheduled jobs. Each component propagates a lane identifier so that downstream services preferentially handle the request within the same AZ. Some cross‑lane calls still occur, especially for message consumption.
Gateway forwards traffic to API Gateway with lane identifier.
API Gateway passes the identifier to downstream SOA services, which default to same‑AZ instances.
SOA services propagate the identifier through RPC calls.
When writing to message queues, the identifier is added to message headers; consumers read it and forward accordingly.
Jobs obtain the lane identifier from their host AZ and include it in RPC calls.
Differences from Same‑City Active‑Active
Deployment: Active‑active runs 100% of instances in each AZ, while multi‑lane runs only 50% per AZ to cut costs.
Database: Active‑active replicates full databases across AZs; multi‑lane relies on a single primary DB with read replicas, simplifying setup but incurring cross‑AZ reads.
Scalability: Active‑active typically uses two AZs; multi‑lane supports N lanes, offering greater extensibility.
Conclusion
The multi‑lane architecture provides a cost‑effective, scalable high‑availability solution for HuoLala, balancing operational complexity with business requirements. While the design is straightforward, implementing and maintaining it involves many components and careful coordination to ensure seamless failover without hindering product iteration.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
