Cloud Native 13 min read

How Alibaba Scaled Service Mesh for Double‑11: Architecture, Challenges, and Performance

This article details Alibaba's large‑scale deployment of a Service Mesh for the Double‑11 shopping event, covering the underlying architecture, four key technical challenges, the solutions implemented with Envoy, Istio, and Sentinel, and the resulting performance impact on latency, CPU, and memory.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How Alibaba Scaled Service Mesh for Double‑11: Architecture, Challenges, and Performance

Background

Alibaba’s cloud‑native platform uses Service Mesh as a core component to support the massive traffic of the Double‑11 (Singles’ Day) e‑commerce applications. The goal was to validate a production‑grade Service Mesh deployment under the strict latency and reliability requirements of these core services.

Deployment Architecture

The mesh consists of three logical planes:

Data Plane : Envoy sidecars injected into each service instance.

Control Plane : Istio Pilot, deployed as an independent Kubernetes cluster rather than co‑located with the sidecars, representing the final intended architecture.

Operations Plane : A custom‑built management layer that handles sidecar lifecycle, configuration distribution, and integration with Alibaba‑specific components.

Deployment diagram
Deployment diagram

Technical Challenges and Solutions

1. Mesh‑ifying without upgrading the RPC SDK

The Java services were locked to a fixed RPC SDK version, leaving no time to develop a mesh‑compatible SDK. Istio normally redirects traffic with iptables NAT, which requires the nf_contrack kernel module that is disabled on Alibaba’s production kernels. By collaborating with the OS team, a custom transparent interception component was built:

Uses the mangle table to mark packets with a user‑ID and a custom MARK value.

Redirects marked packets to the Envoy sidecar, achieving zero‑code‑change mesh adoption.

Transparent interception flow
Transparent interception flow

2. Supporting complex e‑commerce service governance

Alibaba’s internal Java RPC framework originally used Groovy scripts for routing, isolation, and other policies. To avoid retaining this tightly coupled mechanism, the team:

Extended Istio’s native CRDs ( VirtualService and DestinationRule) with custom fields that express RPC‑specific routing criteria such as method name, request parameters, and application name.

Designed a WebAssembly (Wasm) based routing plugin that can replace Groovy scripts while preserving the required flexibility.

Extended CRDs for RPC routing
Extended CRDs for RPC routing

3. Reducing Envoy’s resource overhead

Envoy’s default fine‑grained statistics collection creates a large number of per‑IP counters, leading to high memory consumption (hundreds of thousands of IP‑level entries in large e‑commerce services). The team added a runtime switch to disable IP‑level stats, which cut memory usage by roughly 30 %. Future work includes adopting the community‑proposed stats symbol table to de‑duplicate metric strings and further lower memory pressure.

4. Decoupling business logic from infrastructure upgrades

To enable hot upgrades of sidecars without traffic disruption, a dual‑process scheme was implemented:

Launch a new sidecar instance.

Exchange runtime state (e.g., connection pools, in‑flight requests) with the old sidecar via a Unix Domain Socket.

When the new sidecar signals readiness, it takes over inbound traffic while the old sidecar gracefully drains and exits.

This approach ensures that infrastructure upgrades are invisible to the business layer.

Dual‑process sidecar upgrade
Dual‑process sidecar upgrade

Performance Data

Latency measurements on a representative core service showed:

Provider side latency increased from 5.34 ms to 5.60 ms (Δ 0.26 ms).

Consumer side latency increased from 9.31 ms to 10.36 ms (Δ 1.05 ms).

Aggregated across all Double‑11 services, the average latency rise was 0.52 ms for providers and 1.63 ms for consumers. CPU usage remained around 0.1 core per machine . Memory consumption varied with service size; disabling IP‑level stats saved ~30 % memory, indicating further optimization potential.

Latency comparison chart
Latency comparison chart
CPU & memory usage
CPU & memory usage

Outlook

Future work focuses on:

Collaborating with the Istio community to enhance Pilot’s data‑push capabilities, including integration with Nacos via the MCP protocol.

Adopting the Envoy stats symbol table to further reduce memory overhead.

Refining Istio/Envoy data structures for more compact representation.

Improving sidecar operational tooling to support gray‑scale upgrades, monitoring, and rollback.

These efforts aim to solidify Service Mesh as a production‑ready, independently evolvable layer that decouples business logic from infrastructure changes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Alibabacloud-nativeIstioEnvoyservice-mesh
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.