MTrace: Meituan‑Dianping Distributed Session Tracing System Design and Practice

The article introduces MTrace, Meituan‑Dianping’s large‑scale distributed session tracing system, explaining its call‑chain concept, architecture, data‑embedding SDK, trace and span identifiers, APIs for transparent data propagation, and how it enables bottleneck detection, performance optimization, and comprehensive monitoring across heterogeneous backend services.

Architecture Digest
Architecture Digest
Architecture Digest
MTrace: Meituan‑Dianping Distributed Session Tracing System Design and Practice

The article, derived from Meituan‑Dianping’s Tech Salon session 08, presents the design and practice of MTrace, an internal distributed session tracing system that reconstructs call chains across services using a global traceId.

Core Concepts : traceId (64‑bit global identifier) and spanId (hierarchical identifier such as 0.2) uniquely mark each RPC in a distributed request. Annotations allow business‑side custom data (e.g., user ID) to be attached to the trace.

Data Embedding SDK : Provides a unified SDK for various middleware (Thrift, HTTP, MySQL, Tair, MQ) to generate trace context, store it in ThreadLocal for synchronous calls, and explicitly pass it for asynchronous calls.

Agent Layer : Acts as a data forwarder, enabling traffic control, data routing, and strategy changes without modifying business code.

APIs for Transparent Data Transmission :

put(map<String, String> data) putOnce(map<String, String> data)

The put API propagates data through the entire request chain, while putOnce limits propagation to the next hop only.

Instrumentation Points (four stages):

Client Send – Span span = Tracer.clientSend(param); Server Receive – Tracer.serverRecv(param); Server Send – Tracer.serverSend(); Client Receive – Tracer.clientRecv(); These stages create and archive trace context, which is asynchronously uploaded via a Kafka layer to reduce impact on business services.

Storage and Query : Real‑time trace data are stored in HBase using traceId as the row key for fast retrieval; offline analytics are performed in Hive for metrics such as service in‑degree/out‑degree.

Frontend Visualization : Because timestamps from different machines may drift, the UI orders spans primarily by spanId rather than time, correcting NTP inconsistencies.

Benefits : Enables rapid bottleneck identification, service‑level performance statistics, and systematic optimization of call patterns (e.g., batch calls, reducing redundant invocations).

Summary : MTrace combines call‑chain tracing, data embedding, agent‑based routing, and scalable storage to provide a comprehensive observability platform for large‑scale microservice architectures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MicroservicesPerformance MonitoringDistributed Tracingservice governancebackend observability
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.