Large-Scale Service Mesh Deployment at Ant Group: Practices, Challenges, and Future Outlook
This article details Ant Group's two‑year journey of adopting Service Mesh at massive scale, explaining why Service Mesh is needed for microservice governance, heterogeneous system unification, and financial‑grade security, and describing the architecture, migration strategies, stability mechanisms, operational results, and future directions toward a full mesh and serverless era.
Cloud‑native concepts are booming, yet only a few companies have achieved large‑scale adoption; Ant Group, an early domestic adopter, spent over two years exploring Service Mesh and successfully passed the Double‑11 traffic test.
Why Service Mesh? It addresses three core needs: (1) decoupling microservice governance from business logic, (2) providing unified control over heterogeneous systems written in Java, NodeJS, Go, Python, C++, etc., and (3) delivering financial‑grade network security.
Before Service Mesh, traditional microservice governance relied on SDKs embedded in applications, leading to high upgrade costs, severe version fragmentation, and difficulty evolving middleware because each upgrade required code changes and coordinated releases.
Service Mesh solves these problems by extracting most governance capabilities into an independent sidecar process. Applications focus solely on business logic while middleware teams evolve infrastructure capabilities transparently, enabling independent evolution and faster iteration.
For heterogeneous environments, a single lightweight SDK (or even no SDK) can interact with the sidecar, eliminating the need to maintain multiple language‑specific SDKs and simplifying multi‑protocol traffic control and monitoring.
Financial‑grade security is achieved through identity authentication, access control, and data encryption, allowing services to operate in a zero‑trust network.
Ant's practice began in early 2018. Business teams were concerned about code changes, impact on stability, and migration cost. Ant solved this by enhancing the SOFA SDK to auto‑detect whether Service Mesh is enabled and automatically connect to the sidecar, requiring only an SDK upgrade without any application code modifications.
The service registration and communication flow works as follows: the service registers its IP and port with its sidecar, the sidecar registers with the central registry on a sidecar‑specific port, callers query their sidecar for service addresses, sidecars receive the real service endpoints from the registry, and traffic is proxied through sidecars to the actual service instances.
Smooth migration is enabled by sidecar injection on either the caller or provider side. Either side can be migrated first, with sidecars handling registration and subscription transparently, allowing gray‑scale roll‑out and instant rollback if issues arise.
To ensure stability at Ant’s massive scale, an unattended change framework was introduced, modeled after autonomous driving levels L0‑L5. Ant has reached L3, providing automated batch orchestration, mandatory gray‑scale, pre‑change validation (e.g., peak‑time checks), and post‑change verification (e.g., monitoring, error‑rate checks), with the ability to halt or roll back changes based on results.
The overall architecture combines traditional SDK‑based microservices with Service Mesh (dual‑mode), using Pilot for configuration distribution, Mosn as the data plane (supporting SOFA, Dubbo, Spring Cloud), and supports both container/K8s and virtual machine deployments.
At present, Service Mesh covers thousands of Ant applications, tens of thousands of pods, and handles tens of millions of QPS during peak events, reducing upgrade cycles from 1‑2 times per year to 1‑2 times per month, saving thousands of person‑days annually, improving security, and enabling fine‑grained traffic control, adaptive rate limiting, and service isolation.
Future outlook includes extending more infrastructure capabilities (transactions, caching, configuration, scheduling) into Mosn, moving towards a pure “Micrologic + Sidecar” model, and eventually integrating ordinary business services into a serverless paradigm, thereby achieving true decoupling of business logic from infrastructure.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
