Service Mesh Capability Building: Link Encryption, Adaptive Rate Limiting, Fine‑Grained Traffic Steering, and Service Self‑Healing
The article details Ant Group's large‑scale Service Mesh rollout, explaining the design, implementation, and operational impact of four core capabilities—link encryption, adaptive rate limiting, fine‑grained traffic steering, and service self‑healing—while highlighting performance considerations, deployment challenges, and the overall value of decoupling business logic from infrastructure.
Introduction
Building on the previously published "Ant Service Mesh Large‑Scale Practice and Outlook," this article shares the R&D team's insights on how Service Mesh has transformed infrastructure capabilities at Ant Group, focusing on four key capabilities that have been deployed at scale.
1. Link Encryption
To achieve 100% encrypted communication across the organization, Ant Group introduced a link‑encryption capability that must be transparent to business traffic, support gray‑scale rollout, and keep performance impact within acceptable limits.
Design challenges include simplifying operational complexity for large‑scale deployments, enabling hot switching between plaintext and encrypted connections without request loss, and ensuring the encryption overhead remains low.
The architecture uses a unified control plane to push configuration to services via XDS. MOSN retrieves certificates and private keys from a centralized secret store (SDS), registers its encrypted capability with the service registry, and triggers clients to switch to encrypted communication.
Hot‑switch mechanism relies on a connection‑elimination process: when encryption state changes, MOSN creates new long‑lived TLS connections while gracefully draining old connections, ensuring no in‑flight requests are dropped.
Performance tests show negligible impact on long‑lived connections, though memory usage increased due to a Golang TLS library issue that was later fixed and contributed upstream.
2. Adaptive Rate Limiting
Adaptive rate limiting is a core Mesh traffic‑management feature that automatically adjusts flow control based on real‑time system resource usage, protecting services from overload without manual configuration.
The mechanism consists of four steps: (1) per‑second system‑resource detection, (2) baseline calculation by aggregating interface statistics, (3) baseline regulator that adjusts limits proportionally to resource watermarks, and (4) decision logic that enforces the computed limits.
This capability has been deployed across the entire platform, successfully preventing multiple incidents during high‑traffic events such as the Spring Festival promotion.
3. Fine‑Grained Traffic Steering
Fine‑grained steering exposes atomic traffic‑routing abilities to the control plane, enabling use cases like gray‑release, disaster recovery, data‑center migration, and capacity testing. Single‑application steering can redirect traffic between deployment units, and the system also supports interface‑level steering and multi‑to‑many routing patterns.
Examples include routing high‑priority transfer flows to a dedicated group of instances, isolating critical business paths from noisy traffic, and performing on‑line traffic diversion for performance testing.
4. Service Self‑Healing
Traditional self‑healing relies on external probes with latency and accuracy drawbacks. MOSN implements an internal exception counter that tracks abnormal nodes, temporarily blacklists them, and reports to a central self‑healing service for further actions such as restart or offline handling.
The approach enables second‑level detection and remediation, reducing downtime and improving overall service reliability.
Value of Service Mesh
These capabilities illustrate how Service Mesh decouples business logic from infrastructure, accelerating feature rollout, improving security, performance, and stability, and reducing the operational burden of large‑scale system upgrades.
Future work will explore remaining challenges such as resource utilization and performance overhead.
Conclusion
Over the past year, Ant Group’s Service Mesh implementation has delivered significant improvements in security, efficiency, and reliability, demonstrating the transformative power of decoupling services from underlying infrastructure.
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.