Cloud Native 31 min read

Deep Practice of Service Mesh at Ant Financial: Architecture, Scale, Performance, and Recommendations

This article presents Ant Financial’s extensive Service Mesh deployment, detailing its evolution from research to a global‑scale mesh of over 100 k pods, performance measurements, optimization techniques, operational practices, migration strategies, and practical recommendations for organizations considering Service Mesh adoption.

Architecture Digest
Architecture Digest
Architecture Digest
Deep Practice of Service Mesh at Ant Financial: Architecture, Scale, Performance, and Recommendations

Ant Financial senior technical expert Ao Xiaojian shares the deep practice of Service Mesh at QCon 2019, covering the journey from early research to large‑scale production deployment.

The deployment progressed through several stages: technical research (2017), exploration with SOFAMosn sidecar (2018), small‑scale internal rollout, large‑scale rollout in early 2019 supporting the 618 promotion, and full‑scale deployment in the second half of 2019, now exceeding 100,000 pods and becoming the world’s largest Service Mesh cluster.

Key application scenarios include multi‑language support (Java, Go, Python, C++, NodeJS), transparent application upgrades, fine‑grained traffic control, RPC protocol handling, and observability.

Performance testing comparing workloads with and without SOFAMosn shows modest overhead: CPU usage rises by about 2 % on average, memory usage increases by ~15 MB per node, and latency grows by ~0.2 ms, with some special cases even showing latency reductions due to routing cache optimizations.

The limited performance impact is explained by the dominance of business‑logic processing over sidecar overhead, the elimination of SDK overhead, and the use of localhost communication, as well as caching and efficient header‑only processing.

SOFAMosn optimizations include Golang writev batching, memory reuse via unsafe pointers, protocol upgrades (e.g., switching to Bolt for faster header parsing), and route‑result caching, which together significantly improve data‑plane performance.

Additional performance work addresses Mixer and Pilot components: serialization improvements (using types.Any), pre‑computing xDS resources, and push‑optimizations that reduce CPU usage and configuration propagation latency.

Operationally, Ant Financial follows a three‑pronged online change strategy—gray‑release capability, monitoring, and rollback—augmented by a ScopeConfig mechanism to gray‑roll configuration changes safely.

To enable smooth migration from traditional SDK‑based microservices to Service Mesh, a “dual‑mode microservice” approach is proposed, allowing both SDK and sidecar‑based services to interoperate and migrate transparently.

The ultimate solution combines a control plane (Pilot) with traditional registries/configuration centers (e.g., SOFARegistry, Nacos) via MCP and xDS/UDPA protocols, creating a unified, standards‑based registration and control platform.

Practical adoption advice includes assessing immediate pain points (multi‑language needs, library upgrade difficulty), leveraging Service Mesh for legacy app modernization, consolidating technology stacks, and aligning with cloud‑native strategies such as Kubernetes and serverless.

Ant Financial is preparing for the upcoming Double‑11 event, which will serve as a historic stress test for its massive Service Mesh deployment, and plans to share further insights, open‑source contributions, and commercial offerings in the future.

performanceCloud NativemicroservicesKubernetesservice meshant financialSOFAMosn
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.