Cloud Native 19 min read

Meituan's Practice and Challenges in Deploying Service Mesh in Private Cloud Clusters

Meituan’s rollout of a private‑cloud service mesh—built on an Envoy data plane, custom control plane, and language‑agnostic SDKs—overcame compatibility, scaling, and heterogeneity hurdles through isolation, standardized governance runtime, data sharding, and centralized health checks, now supporting hundreds of services with sub‑millisecond latency and unified multi‑language management.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Meituan's Practice and Challenges in Deploying Service Mesh in Private Cloud Clusters

In private cloud cluster environments, building Service Mesh often requires large‑scale changes to the existing technical architecture and faces challenges such as compatibility difficulties, large‑scale support obstacles, and promotion difficulties. The article systematically describes the challenges encountered and practical experiences gained by Meituan during its Service Mesh rollout, hoping to provide inspiration or help.

Meituan’s service governance platform OCTO has evolved through four stages: (1) basic governance capability unification (unified communication framework and registry center); (2) performance and usability enhancement (QPS increase from 20k to near 100k, 99th‑percentile latency 1ms, distributed tracing and fine‑grained timing); (3) full‑spectrum governance enrichment (full‑link pressure testing, performance diagnosis, stability assurance, authentication/encryption, link‑level traffic governance); (4) cross‑region disaster recovery and expansion capabilities (unitization under tens of millions of daily orders, interconnection of all PaaS components and core storage systems).

Current dilemmas include strong coupling between business and middleware, severe middleware version fragmentation, difficulty integrating heterogeneous systems, and weak governance for non‑Java languages due to lack of official SDKs.

The proposed optimization follows three steps: isolation and decoupling, building a standardized governance runtime on the isolated infrastructure, and constructing the governance system on top of that standard.

Meituan’s Service Mesh architecture adopts Envoy‑based data plane (internal project OCTO Mesh) and a self‑developed control plane. Lightweight SDKs in each language communicate with the proxy via Unix Domain Sockets (UDS). The control plane consists of Pilot (core governance), Dispatcher (shields heterogeneous subsystem differences), centralized health‑check manager, Config Server (policy management), monitoring/inspection system, and Meta Server (node registration, addressing, isolation, and horizontal scaling).

To achieve business‑transparent compatibility, Meituan deeply integrates Mesh with OCTO, supports containers, VMs, and physical machines, aligns operation systems, ensures protocol compatibility (semantic content agreed between SDK and proxy), and enables seamless switching between Mesh and non‑Mesh modes via a visual toggle with real‑time effect.

Addressing heterogeneity, the solution defines the data plane + control plane as a standardized service governance runtime, builds a unified access center (Dispatcher) to shield differences among governance subsystems, rebuilds SDKs for six languages, and uses an enhanced unified protocol for cross‑language interoperability.

Scalability challenges with Istio stem from: each control plane instance holding full ETCD data (no horizontal scaling), each proxy independently interacting with ETCD causing I/O redundancy, and each node probing all others leading to a network storm. Meituan’s countermeasures are: (1) horizontal data sharding via Meta Server so proxies connect to the appropriate control plane instance and only load needed data; (2) vertical layered subscription using snapshot caching and indexing to reduce ZK watchers, plus a governance‑layer I/O multiplexing to boost throughput; (3) centralized health checking replacing P2P probes, dropping check counts from tens of billions to hundreds of thousands per cycle and enabling immediate anomaly‑driven re‑checks.

For transaction‑oriented businesses with low fault tolerance, Meituan emphasizes business‑value‑driven rollout: pre‑validation on non‑core services, automated SDK version checks, one‑click Mesh activation with traffic‑ratio control via the platform, and automatic rollback to non‑Mesh mode upon anomaly detection. Performance optimizations (UDS, incremental aggregation, serialization improvements) achieve >34k QPS, average one‑hop latency 0.207 ms, and 99th‑percentile latency around 0.4 ms in a 2‑core 4 GB echo test.

Results show over 600 online services and 3,500+ offline services using Mesh, successful rapid integration of heterogeneous systems such as Mobike’s gRPC framework, unified multi‑language governance capabilities, and a foundation for further enrichment of governance features.

The practice validates the feasibility of the model, delivering short‑term value in heterogeneous system integration and multi‑language unification, while long‑term expectations center on richer governance outputs under a standardized runtime and business‑decoupled, centrally controlled architecture.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeScalabilityService MeshCompatibilityMeituanheterogeneityOCTOTransactional Systems
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.