How JD’s CallGraph Transforms Distributed Tracing for Real‑Time Operations
CallGraph, JD.com’s in‑house distributed tracing platform, provides low‑intrusion, high‑performance monitoring for micro‑service ecosystems, enabling real‑time call‑graph analysis, TP metrics, flexible configuration, and future extensions such as deep‑learning‑driven insights.
Background and Motivation
With JD’s rapid business growth, the adoption of SOA and micro‑service architectures created a massive, dynamic web of inter‑dependent services. Traditional monitoring could not keep pace, prompting the need for a tool that could automatically capture, reconstruct, and analyze end‑to‑end call flows to support optimization, scaling, throttling, and degradation decisions.
CallGraph Overview
Inspired by Google’s Dapper paper, CallGraph is a JD‑developed, production‑grade distributed tracing system. It shares core capabilities with similar tools (e.g., Taobao Eagle Eye, Sina WatchMan) but adds JD‑specific features such as ultra‑low‑intrusion instrumentation and real‑time analytics.
Core Concept: Call Chain
A call chain records every hop from the initial request (web page, mobile client, etc.) to the final backend component (database, cache). Each request generates a globally unique TraceId that is propagated transparently across services, allowing scattered logs to be correlated and reassembled.
Key Features and Use Cases
CallGraph combines traditional monitoring with advanced capabilities:
Low‑intrusion instrumentation via bytecode enhancement.
Configurable sampling rates to balance data volume and performance.
Separate handling of TP (transaction‑per‑second) metrics and link logs for accurate latency reporting.
Real‑time dashboards (seconds‑level monitoring) powered by Jimdb.
Dynamic, multi‑dimensional configuration (by application, group, IP, etc.) for rapid enable/disable during peak events.
Design Goals
Low Intrusion : No code changes required for business services.
Minimal Performance Impact : In‑memory log buffers and asynchronous batch writes.
Flexible Policy : Per‑application sampling, TP tracking, and on‑the‑fly configuration.
Time‑Efficiency : End‑to‑end latency from log generation to visualization is kept as short as possible.
Architecture
The system consists of several layers:
CallGraph Core Package : Embedded in middleware JARs; generates logs and exposes APIs (startTrace, endTrace, clientSend, serverRecv, etc.).
JMQ : JD’s high‑throughput distributed message queue that transports logs to downstream processors.
Storm : Real‑time stream processing that parses logs, computes metrics, and routes results to storage.
Storage : Real‑time data stored in Jimdb, HBase, Elasticsearch; offline data kept in HDFS and Spark, with summary tables in MySQL.
CallGraph UI : Web interface for visualizing call graphs, dependency maps, entry/exit analysis, and per‑trace details.
UCC : JD’s distributed configuration service that holds all runtime settings.
Metadata Service : Stores mappings between trace signatures, applications, and service methods.
Technical Implementation
Instrumentation and Context Propagation
Instrumentation is performed by enhancing middleware JARs. For HTTP requests, a servlet filter invokes startTrace and endTrace. For RPC frameworks, the core package provides primitives (clientSend, serverRecv, serverSend, clientRecv) that automatically create and transmit a TraceId via ThreadLocal and network headers.
Cross‑process propagation uses the same TraceId embedded in request metadata. Asynchronous calls are handled by bytecode‑enhanced Thread and thread‑pool classes, ensuring child threads inherit the parent context without developer intervention.
Log Format
Each log entry contains a fixed section (TraceId, RpcId, timestamp, call type, peer IP, latency, result, middleware‑specific fields, payload size) and a variable section for custom user data. Encoders specific to each middleware serialize the fields.
High‑Performance Log Output
Logs are written to an in‑memory virtual disk, avoiding physical I/O. A dedicated log module batches writes asynchronously and can drop logs under overload to protect the business workload.
TP vs. Link Logs
TP (transaction‑per‑second) metrics are sampled at a lower rate (e.g., 1/1000) but always recorded for every call to preserve accurate latency percentiles. Link logs are sampled more aggressively. Separate pipelines process each type.
Real‑Time Configuration
During high‑traffic events (e.g., Double‑11), operators can adjust sampling rates, enable/disable tracing, or toggle TP collection via the UI. Settings are stored in UCC and synchronized to local config files; a daemon thread in the core package reloads them periodically.
Storm Stream Processing
All logs flow through Storm bolts. Link‑log bolts aggregate call counts, average latency, error rates, and store results in HBase/HDFS for offline analysis. TP‑log bolts compute per‑second metrics and write them to Jimdb for the UI’s second‑level dashboards.
Future Roadmap
Reduce end‑to‑end latency to true real‑time by optimizing the log‑collection‑Storm path.
Introduce automated error detection and alerting.
Support additional middleware to extend trace coverage.
Expose a public API for other teams to consume trace data.
Apply deep‑learning models to historical trace data for predictive insights.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
