Big Data 17 min read

Design and Evolution of Meituan's OCTO Data Center (Watt) for Trillion‑Scale Real‑Time Analytics

Meituan’s self‑built OCTO data center, codenamed Watt, transforms over ten trillion daily records into multi‑dimensional, real‑time metrics by using stateless, horizontally‑scalable compute nodes, hierarchical aggregation, and lock‑free processing, achieving sub‑second latency, five‑nine availability, and reducing weekly operations from twenty hours to ten minutes.

Meituan Technology Team

Apr 23, 2020

Design and Evolution of Meituan's OCTO Data Center (Watt) for Trillion‑Scale Real‑Time Analytics

Meituan's self‑developed OCTO data center (codenamed Watt) processes trillions of data items per day, offering strong scalability and real‑time capabilities while keeping weekly operational costs under ten minutes for thousands of instances.

This article details the evolution and architectural design of the Watt compute engine, describing the technologies used to boost computing power, throughput, and reduce operational overhead.

1. OCTO Data Center Overview

1.1 System Introduction

OCTO is Meituan's standardized service‑governance infrastructure, covering registration, discovery, data governance, load balancing, fault tolerance, and gray‑release. It aims to improve development efficiency, lower operation costs, and increase application stability.

1.2 Business Introduction

The OCTO data center provides multi‑dimensional, precise latency (Top Percentile), QPS, success‑rate metrics with granularity ranging from seconds to days, supporting complex queries such as "client IP + service interface" or "host + interface".

Watt currently handles over ten trillion records daily across more than ten dimensions. Rapid data growth has driven a full technical evolution of the system.

1.3 Original Architecture

The original OCTO compute engine suffered from several issues:

Stateful compute nodes that could not be automatically migrated when failures occurred, leading to data loss and requiring ~20 hours of weekly manual O&M.

Lack of horizontal scalability; large‑scale services required dedicated high‑spec machines.

Overall system stability was low; a slow node could cause cascade failures.

Hot‑spot and data‑skew problems, with some services generating far more data than others, making routing policies fragile.

The original architecture processed about 300 billion records per day and exhibited frequent alert loss, inaccurate data, multi‑hour delays, and limited support for secondary dimensions.

1.4 New Architecture Goals

High throughput and extreme horizontal scalability (20×+ scaling, supporting >10 trillion daily records).

Precise data with >5 nines availability, self‑healing nodes, no data loss.

High reliability and availability (stateless compute nodes, 1/3 node failure tolerance, peak‑shaving).

Low latency (TP99 < 10 s for second‑level metrics, < 2 min for minute‑level metrics).

1.5 New Architecture Challenges

Massive data skew and multi‑dimensional metrics across services with volume differences of up to a million‑fold.

Accurate percentile (TP) calculation without sampling for trillion‑scale data.

Ensuring high stability with 1/3 node failures requiring no manual intervention.

Supporting 2.4× yearly data growth with horizontal scaling.

Achieving precise, real‑time, multi‑dimensional TP computation at trillion scale.

2. Compute Engine Technical Design

2.1 Solution Selection

Traditional big‑data stacks (Flink, Spark, OLAP) consume excessive resources for trillion‑scale, fine‑grained percentile calculations. Therefore a custom solution was designed.

2.2 System Design Principles

Stateless compute nodes with heartbeat‑based automatic exclusion and task reassignment.

Separation of online and offline computation; shared pre‑computation performed once.

Load‑balancing via hash‑based distribution to mitigate hot‑spots.

Distributed sub‑computations that preserve exact percentile accuracy despite partitioning.

Multi‑stage dimensionality reduction to reuse intermediate results and lower overall compute load.

2.3 Detailed Technical Breakdown

2.3.1 Data Flow

The system builds a recursive topology tree for each statistical dimension. Yellow nodes represent message‑queue topics, green nodes represent compute sub‑clusters that consume topics, aggregate metrics, and forward compressed results downstream. Red arrows indicate data flow. Each child cluster processes a strict subset of its parent’s dimensions, enabling hierarchical aggregation.

2.3.2 Compute Model

Data with identical dimension values are routed to the same compute node within a cluster. Nodes consume messages, aggregate them into "count‑card tables" (maps of metric → count). These tables are merged with same‑dimension tables from other nodes, stored in time windows, and, if further sub‑clusters exist, re‑hashed and forwarded downstream. Offline daily calculations reuse intermediate results, ensuring no loss of precision for TP metrics.

2.3.3 Key Technologies

Hot‑spot mitigation via multi‑level hash distribution and hierarchical dimensionality reduction.

Computation reduction through pre‑computation reuse and distributed multi‑path merging.

IO optimization with custom protobuf serialization and batch MQ handling.

HBase row‑key hotspot elimination.

Lock‑free processing using a self‑developed thread‑bucket streaming batch model.

Full‑stack horizontal scalability across compute, transport, and storage layers.

2.4 Reliability, Low‑Ops, and Accuracy Design

Stateless nodes with rapid self‑healing (seconds to detect and recover).

Real‑time anomaly detection via heartbeat removal and data back‑fill.

Fault isolation per dimension and multi‑datacenter disaster recovery.

Exact TP calculation preserved through hierarchical aggregation without data distortion.

2.5 Real‑Time Enhancements

Streaming topology with distributed sub‑cluster result reuse, reducing compute load.

Lock‑free processing model.

Instant anomaly monitoring and rebalance.

Second‑level computation isolated in memory.

3. Optimization Results

Daily processed data exceeds 1 trillion records; the system now supports >10 trillion daily volume, with peak second‑level throughput >5 hundred million records.

Single‑service daily throughput increased >1 000×, supporting up to 500 billion records per service.

Maximum latency reduced from >4 hours to <2 minutes; TP99 for second‑level metrics is ~6 seconds.

Average latency dropped from 4.7 minutes to ~1.5 minutes.

Cluster stability improved: no snow‑balling failures; 1/3 node failures auto‑recover within 2 seconds.

Supports multi‑dimensional, real‑time precise TP calculation up to six‑nines precision across all services.

Operational effort reduced from >20 hours/week to <10 minutes per week for a thousand‑node cluster.

4. Conclusion

The paper presents a trillion‑scale, multi‑dimensional, precise TP compute engine solution suitable for ultra‑large digital governance systems, addressing scalability, real‑time performance, accuracy, stability, and operational cost. Meituan's foundational R&D platform invites industry peers to discuss and collaborate.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

real-time analytics Meituan

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.