Cloud Native 17 min read

Large-Scale Service Mesh Deployment at Ant Group: Practices, Challenges, and Future Outlook

This article details Ant Group's two‑year journey of adopting Service Mesh at massive scale, explaining why Service Mesh is needed for microservice governance, heterogeneous system unification, and financial‑grade security, and describing the architecture, migration strategies, stability mechanisms, operational results, and future directions toward a full mesh and serverless era.

AntTech

Jan 14, 2021

Large-Scale Service Mesh Deployment at Ant Group: Practices, Challenges, and Future Outlook

Cloud‑native concepts are booming, yet only a few companies have achieved large‑scale adoption; Ant Group, an early domestic adopter, spent over two years exploring Service Mesh and successfully passed the Double‑11 traffic test.

Why Service Mesh? It addresses three core needs: (1) decoupling microservice governance from business logic, (2) providing unified control over heterogeneous systems written in Java, NodeJS, Go, Python, C++, etc., and (3) delivering financial‑grade network security.

Before Service Mesh, traditional microservice governance relied on SDKs embedded in applications, leading to high upgrade costs, severe version fragmentation, and difficulty evolving middleware because each upgrade required code changes and coordinated releases.

Service Mesh solves these problems by extracting most governance capabilities into an independent sidecar process. Applications focus solely on business logic while middleware teams evolve infrastructure capabilities transparently, enabling independent evolution and faster iteration.

For heterogeneous environments, a single lightweight SDK (or even no SDK) can interact with the sidecar, eliminating the need to maintain multiple language‑specific SDKs and simplifying multi‑protocol traffic control and monitoring.

Financial‑grade security is achieved through identity authentication, access control, and data encryption, allowing services to operate in a zero‑trust network.

Ant's practice began in early 2018. Business teams were concerned about code changes, impact on stability, and migration cost. Ant solved this by enhancing the SOFA SDK to auto‑detect whether Service Mesh is enabled and automatically connect to the sidecar, requiring only an SDK upgrade without any application code modifications.

The service registration and communication flow works as follows: the service registers its IP and port with its sidecar, the sidecar registers with the central registry on a sidecar‑specific port, callers query their sidecar for service addresses, sidecars receive the real service endpoints from the registry, and traffic is proxied through sidecars to the actual service instances.

Smooth migration is enabled by sidecar injection on either the caller or provider side. Either side can be migrated first, with sidecars handling registration and subscription transparently, allowing gray‑scale roll‑out and instant rollback if issues arise.

To ensure stability at Ant’s massive scale, an unattended change framework was introduced, modeled after autonomous driving levels L0‑L5. Ant has reached L3, providing automated batch orchestration, mandatory gray‑scale, pre‑change validation (e.g., peak‑time checks), and post‑change verification (e.g., monitoring, error‑rate checks), with the ability to halt or roll back changes based on results.

The overall architecture combines traditional SDK‑based microservices with Service Mesh (dual‑mode), using Pilot for configuration distribution, Mosn as the data plane (supporting SOFA, Dubbo, Spring Cloud), and supports both container/K8s and virtual machine deployments.

At present, Service Mesh covers thousands of Ant applications, tens of thousands of pods, and handles tens of millions of QPS during peak events, reducing upgrade cycles from 1‑2 times per year to 1‑2 times per month, saving thousands of person‑days annually, improving security, and enabling fine‑grained traffic control, adaptive rate limiting, and service isolation.

Future outlook includes extending more infrastructure capabilities (transactions, caching, configuration, scheduling) into Mosn, moving towards a pure “Micrologic + Sidecar” model, and eventually integrating ordinary business services into a serverless paradigm, thereby achieving true decoupling of business logic from infrastructure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Microservices scalability devops Service Mesh infrastructure

Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.