Cloud Native 20 min read

How Meituan Scaled Service Governance with OCTO Mesh: Architecture & Lessons

Meituan’s OCTO Mesh transforms its massive service governance by adopting a Service Mesh architecture with sidecar proxies, a custom control plane, and meta‑server driven routing, addressing multi‑language support, middleware coupling, heterogenous integration, and scalability challenges while detailing design choices, health‑check strategies, and operational tooling.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
How Meituan Scaled Service Governance with OCTO Mesh: Architecture & Lessons

Background and Motivation

Meituan’s OCTO is a standardized service governance platform that now covers about 90% of the company’s applications and handles over a trillion daily calls. While the system is mature, it faces several challenges: limited multi‑language support, tight coupling between middleware and business code, high integration cost for heterogeneous technologies, and decentralized governance decisions.

Why Service Mesh?

Adopting a Service Mesh allows each service instance to run a sidecar proxy that handles all inbound and outbound traffic. Core governance functions such as routing and rate limiting move to the sidecar and a centralized control plane, providing a language‑agnostic solution, decoupling middleware upgrades from business deployments, and simplifying integration of heterogeneous subsystems.

Technical Selection and Architecture Design

The design emphasizes four criteria: preserving existing standards, maintaining governance capabilities, supporting ultra‑large scale, and staying close to the open‑source community.

Data plane built on a heavily customized Envoy filter.

Control plane developed in‑house (named Adcore), consisting of Pilot, Dispatcher, health‑check, node management, monitoring, and a Meta Server for service registration and discovery.

Sidecar ( OCTO Proxy) deployed 1:1 with business processes, communicating via UNIX domain sockets and TCP across nodes.

LEGO Agent manages proxy lifecycle and hot upgrades, reducing manual intervention.

Control Plane Details

Adcore Pilot acts as the brain, handling most governance logic and interacting directly with sidecars. Dispatcher serves as an access hub for auxiliary subsystems. Health checks are centralized rather than full mesh P2P, reducing load from N² checks. The Meta Server implements consistent hashing to shard routing data per pilot, enabling efficient failover and load balancing.

Data Flow

Sidecars and pilots communicate via bidirectional streaming using an enhanced xDS protocol. Custom protocols deliver governance commands beyond routing, such as authentication and circuit breaking.

Key Design Analyses

Large‑Scale Mesh Capabilities

Control plane nodes are horizontally scalable; each pilot only holds data for its managed sidecars.

On network partitions, the system can absorb traffic spikes.

Hybrid health‑check combines centralized monitoring with selective P2P checks.

Heterogeneous System Integration

A unified access center (Dispatcher) abstracts away diverse storage and pub/sub mechanisms of existing subsystems. Changes are pushed as lightweight notifications; pilots fetch full data on demand, keeping message queues small and avoiding version conflicts.

Stability Guarantees

To mitigate the inherent complexity of a new mesh, Meituan built extensive fault isolation, automatic rollback, flexible availability controls, observability, and regression testing. A Mock‑Sidecar framework simulates sidecar behavior for control‑plane testing, allowing step‑wise YAML‑defined scenarios and parallel stress tests.

Operations System

The LEGO platform orchestrates proxy upgrades: operators specify target versions and scopes, resources are stored in a repository, and LEGO agents pull and launch new proxies, with built‑in polling to ensure successful rollouts.

Conclusion and Outlook

Key takeaways include the importance of standardization, performance, and ease of use; the necessity of aligning mesh adoption with existing containerization and governance stacks; and the value of a robust stability and operations framework. Future work will expand OCTO Mesh capabilities, broaden traffic types, and explore centralized governance for global optimal decisions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeService Meshlarge-scale systemsSidecarservice governanceControl Plane
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.