MTrace: Meituan‑Dianping Distributed Session Tracing System Design and Practice
The article introduces MTrace, Meituan‑Dianping’s large‑scale distributed session tracing system, explaining its call‑chain concept, architecture, data‑embedding SDK, trace and span identifiers, APIs for transparent data propagation, and how it enables bottleneck detection, performance optimization, and comprehensive monitoring across heterogeneous backend services.
The article, derived from Meituan‑Dianping’s Tech Salon session 08, presents the design and practice of MTrace, an internal distributed session tracing system that reconstructs call chains across services using a global traceId.
Core Concepts : traceId (64‑bit global identifier) and spanId (hierarchical identifier such as 0.2) uniquely mark each RPC in a distributed request. Annotations allow business‑side custom data (e.g., user ID) to be attached to the trace.
Data Embedding SDK : Provides a unified SDK for various middleware (Thrift, HTTP, MySQL, Tair, MQ) to generate trace context, store it in ThreadLocal for synchronous calls, and explicitly pass it for asynchronous calls.
Agent Layer : Acts as a data forwarder, enabling traffic control, data routing, and strategy changes without modifying business code.
APIs for Transparent Data Transmission :
put(map<String, String> data) putOnce(map<String, String> data)
The put API propagates data through the entire request chain, while putOnce limits propagation to the next hop only.
Instrumentation Points (four stages):
Client Send – Span span = Tracer.clientSend(param); Server Receive – Tracer.serverRecv(param); Server Send – Tracer.serverSend(); Client Receive – Tracer.clientRecv(); These stages create and archive trace context, which is asynchronously uploaded via a Kafka layer to reduce impact on business services.
Storage and Query : Real‑time trace data are stored in HBase using traceId as the row key for fast retrieval; offline analytics are performed in Hive for metrics such as service in‑degree/out‑degree.
Frontend Visualization : Because timestamps from different machines may drift, the UI orders spans primarily by spanId rather than time, correcting NTP inconsistencies.
Benefits : Enables rapid bottleneck identification, service‑level performance statistics, and systematic optimization of call patterns (e.g., batch calls, reducing redundant invocations).
Summary : MTrace combines call‑chain tracing, data embedding, agent‑based routing, and scalable storage to provide a comprehensive observability platform for large‑scale microservice architectures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
