How Alibaba Scaled Service Mesh for Double‑11: Architecture, Challenges & Performance
This article details Alibaba's large‑scale Service Mesh deployment for Double‑11 core applications, covering the three‑plane architecture, key challenges such as SDK‑free mesh, complex routing, rate limiting, Envoy overhead, and the performance impact on latency, CPU and memory, while outlining future roadmap and open‑source collaboration.
Cloud native has become the foundational infrastructure for Alibaba's ecosystem, and Service Mesh, as a core cloud‑native technology, was successfully validated in the demanding Double‑11 core‑application scenarios. The author shares the challenges faced and solutions adopted during this rollout.
Deployment Architecture
The deployment architecture (see image below) focuses on RPC mesh between Service A and Service B. Service Mesh comprises three planes:
Data Plane – the open‑source Envoy sidecar (Sidecar and Envoy are interchangeable in this article).
Control Plane – the open‑source Istio, currently using only the Pilot component.
Operation Plane – a fully custom implementation.
Unlike the previous rollout, Pilot is now deployed as an independent cluster rather than co‑located with Envoy sidecars, representing the final state of Service Mesh control‑plane deployment.
Challenges
1. Mesh without upgrading SDK
During Double‑11, the Java RPC SDK version was locked, leaving no time to develop a mesh‑compatible SDK. Istio normally uses iptables NAT for transparent traffic interception, but the nf_contrack kernel module was removed in Alibaba's production machines, preventing the community solution. Alibaba OS team co‑developed a custom transparent interception component based on userid and mark identifiers, implemented via the iptables mangle table.
This enables traffic hijacking to Envoy without application changes, but the original SDK still performs service discovery and routing, causing double discovery and added latency on the Consumer side.
2. Supporting complex service governance
Alibaba's Java RPC framework embeds Groovy scripts for routing based on method name, parameters, and application name. The plan is to replace Groovy with extensions to Istio's native CRDs (VirtualService and DestinationRule) that express RPC‑specific routing.
Current customizations in Istio/Envoy introduce hack logic; future work will design Wasm‑based routing plugins to provide flexible yet maintainable routing policies.
3. Rate limiting
Instead of Istio’s Mixer, Alibaba integrates the widely used Sentinel component as an Envoy filter for Dubbo protocol. Configuration is fetched by Pilot from Nacos and distributed via xDS.
4. Envoy resource overhead
Envoy’s fine‑grained stats (down to IP level) caused large memory consumption, especially with hundreds of thousands of IPs in Alibaba’s e‑commerce services. A stats switch was added to disable IP‑level stats, reducing memory usage by about 30%.
Future work will adopt the community stats symbol‑table approach to eliminate duplicate metric strings and further cut memory.
5. Decoupling business and infrastructure
To achieve zero‑downtime upgrades, a dual‑process sidecar strategy is used: a new sidecar container is started, exchanges runtime data with the old sidecar, then takes over traffic while the old sidecar gracefully exits after a delay.
The hot‑upgrade relies on Unix Domain Socket and graceful shutdown mechanisms.
Performance Data
During Double‑11, a machine with Service Mesh showed average Provider latency of 5.6 ms (vs 5.34 ms without Mesh, +0.26 ms) and Consumer latency of 10.36 ms (vs 9.31 ms, +1.05 ms). Across all core applications, Mesh added 0.52 ms on the Provider side and 1.63 ms on the Consumer side.
CPU usage of Envoy remained around 0.1 core per core application, with occasional spikes from Pilot data pushes. Memory consumption varies with service and cluster size, indicating significant optimization potential.
Outlook
Future focus includes:
Collaborating with the Istio community to enhance Pilot’s data‑push capabilities, integrating with Nacos via the MCP protocol.
Optimizing Istio and Envoy data structures to further reduce memory overhead.
Improving large‑scale sidecar operations: gray‑scale upgrades, monitoring, and rollback.
Realizing the full value of Service Mesh by enabling business and infrastructure to evolve independently.
Through open‑source contributions and large‑scale practice, Alibaba aims to advance cloud‑native technologies for broader adoption.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
