How Alibaba Scaled Service Mesh for Double‑11: Architecture, Challenges, and Performance
This article details Alibaba's large‑scale deployment of a Service Mesh for the Double‑11 shopping event, covering the underlying architecture, four key technical challenges, the solutions implemented with Envoy, Istio, and Sentinel, and the resulting performance impact on latency, CPU, and memory.
Background
Alibaba’s cloud‑native platform uses Service Mesh as a core component to support the massive traffic of the Double‑11 (Singles’ Day) e‑commerce applications. The goal was to validate a production‑grade Service Mesh deployment under the strict latency and reliability requirements of these core services.
Deployment Architecture
The mesh consists of three logical planes:
Data Plane : Envoy sidecars injected into each service instance.
Control Plane : Istio Pilot, deployed as an independent Kubernetes cluster rather than co‑located with the sidecars, representing the final intended architecture.
Operations Plane : A custom‑built management layer that handles sidecar lifecycle, configuration distribution, and integration with Alibaba‑specific components.
Technical Challenges and Solutions
1. Mesh‑ifying without upgrading the RPC SDK
The Java services were locked to a fixed RPC SDK version, leaving no time to develop a mesh‑compatible SDK. Istio normally redirects traffic with iptables NAT, which requires the nf_contrack kernel module that is disabled on Alibaba’s production kernels. By collaborating with the OS team, a custom transparent interception component was built:
Uses the mangle table to mark packets with a user‑ID and a custom MARK value.
Redirects marked packets to the Envoy sidecar, achieving zero‑code‑change mesh adoption.
2. Supporting complex e‑commerce service governance
Alibaba’s internal Java RPC framework originally used Groovy scripts for routing, isolation, and other policies. To avoid retaining this tightly coupled mechanism, the team:
Extended Istio’s native CRDs ( VirtualService and DestinationRule) with custom fields that express RPC‑specific routing criteria such as method name, request parameters, and application name.
Designed a WebAssembly (Wasm) based routing plugin that can replace Groovy scripts while preserving the required flexibility.
3. Reducing Envoy’s resource overhead
Envoy’s default fine‑grained statistics collection creates a large number of per‑IP counters, leading to high memory consumption (hundreds of thousands of IP‑level entries in large e‑commerce services). The team added a runtime switch to disable IP‑level stats, which cut memory usage by roughly 30 %. Future work includes adopting the community‑proposed stats symbol table to de‑duplicate metric strings and further lower memory pressure.
4. Decoupling business logic from infrastructure upgrades
To enable hot upgrades of sidecars without traffic disruption, a dual‑process scheme was implemented:
Launch a new sidecar instance.
Exchange runtime state (e.g., connection pools, in‑flight requests) with the old sidecar via a Unix Domain Socket.
When the new sidecar signals readiness, it takes over inbound traffic while the old sidecar gracefully drains and exits.
This approach ensures that infrastructure upgrades are invisible to the business layer.
Performance Data
Latency measurements on a representative core service showed:
Provider side latency increased from 5.34 ms to 5.60 ms (Δ 0.26 ms).
Consumer side latency increased from 9.31 ms to 10.36 ms (Δ 1.05 ms).
Aggregated across all Double‑11 services, the average latency rise was 0.52 ms for providers and 1.63 ms for consumers. CPU usage remained around 0.1 core per machine . Memory consumption varied with service size; disabling IP‑level stats saved ~30 % memory, indicating further optimization potential.
Outlook
Future work focuses on:
Collaborating with the Istio community to enhance Pilot’s data‑push capabilities, including integration with Nacos via the MCP protocol.
Adopting the Envoy stats symbol table to further reduce memory overhead.
Refining Istio/Envoy data structures for more compact representation.
Improving sidecar operational tooling to support gray‑scale upgrades, monitoring, and rollback.
These efforts aim to solidify Service Mesh as a production‑ready, independently evolvable layer that decouples business logic from infrastructure changes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
