Cloud Native 14 min read

Ant Group's Service Mesh Exploration and Cloud‑Native Application Runtime

This article details Ant Group's evolution from monolithic applications to microservices and its large‑scale adoption of Service Mesh, describing the motivations, challenges, architecture, multi‑mesh capabilities, and the design of a cloud‑native application runtime built on MOSN to decouple applications from infrastructure.

AntTech
AntTech
AntTech
Ant Group's Service Mesh Exploration and Cloud‑Native Application Runtime

At the ArchSummit Shanghai, Ant Group presented its journey from early monolithic payment systems to a large‑scale microservice architecture, highlighting key milestones in 2006‑2020 that led to the adoption of cloud‑native principles such as Service Mesh and Serverless.

The introduction of Service Mesh is portrayed as a critical path for achieving cloud‑native applications, enabling the decoupling of business logic from infrastructure and providing unified governance for RPC, routing, circuit breaking, and rate limiting across heterogeneous language stacks.

Challenges with the existing SOFARPC SDK—high upgrade cost, version fragmentation, and lack of cross‑language support—motivated the shift to a data‑plane approach where a sidecar (MOSN) handles these concerns, allowing independent evolution of the underlying capabilities.

Since late 2017, Ant has pursued Service Mesh with milestones including the open‑source sidecar MOSN (Golang), the rollout of Message Mesh and DB Mesh in 2019, and achieving over 80% mesh adoption across online services by the 2020 Double‑11 promotion.

The Mesh architecture consists of a control plane (service governance, PaaS, monitoring) and a data plane (MOSN) that manages RPC, messaging, MVC, and task traffic, while exposing health checks, monitoring, configuration, security, and risk mitigation services.

Building on the Mesh experience, Ant defined a "cloud‑native application runtime" that abstracts distributed capabilities into APIs, decouples applications from specific implementations, and uses gRPC for communication, enabling interchangeable components such as SOFARegistry, Nacos, or Zookeeper.

Design principles focus on capability‑first APIs, intuitive defaults, and implementation‑agnostic interfaces, resulting in three proto definitions: mosn.proto (app‑to‑runtime), appcallback.proto (runtime‑to‑app), and actuator.proto (runtime operations).

Component management treats each distributed capability as a Service with multiple interchangeable Components (e.g., MQ‑pub implemented by SOFAMQ or Kafka), allowing runtime to route requests to the appropriate implementation based on configuration.

A comparison between traditional Mesh and the new runtime highlights the runtime's broader scope, API‑centric model, and support for multi‑mesh scenarios such as Message, Cache, and Config Mesh.

Real‑world use cases include heterogeneous language integration (Java, Node.js, others via gRPC), vendor‑agnostic deployments across private, public, and hybrid clouds, and a FaaS cold‑start pre‑warming pool that reduces function startup latency by up to 80%.

Future plans involve community co‑creation of standard cloud‑native APIs, continued open‑source releases of the runtime (targeting version 1.0 by year‑end), and ongoing exploration of multi‑mesh and runtime enhancements.

distributed systemsCloud NativeMicroservicesruntimeservice meshAPIMOSN
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.