CallGraph: JD.com's Distributed Tracing and Service Governance Platform
CallGraph is JD.com's internally developed distributed tracing and service governance platform that addresses the challenges of monitoring complex microservice architectures by providing low‑intrusion, low‑latency tracing, real‑time analytics, configurable sampling, and integration with JMQ, Storm, Spark, HBase, and JimDB for both operational insight and performance optimization.
Background – With JD.com’s rapid business growth and the adoption of SOA and micro‑service strategies, the number and complexity of distributed applications exceeded manual monitoring capabilities, creating a need for a tool that could visualize system behavior and support process, architecture, and performance optimizations.
Core Concept – CallGraph is based on Google’s Dapper paper and implements a trace‑centric model where each request generates a globally unique TraceId that is propagated transparently across services, allowing isolated logs to be linked into complete call chains.
Features & Use Cases – Beyond standard monitoring, CallGraph offers low‑intrusion instrumentation, configurable sampling, real‑time TP metrics, and visualizations such as dependency graphs, entry analysis, and detailed path inspection, enabling rapid troubleshooting and data‑driven decision making.
Design Goals – The system is designed for low invasiveness, minimal performance impact, flexible configuration, and time‑efficiency from data collection to presentation.
Architecture – The architecture consists of a core instrumentation package, JMQ for log transport, Storm for stream processing, and storage layers (real‑time: JimDB, HBase, ES; offline: HDFS, Spark, MySQL). A UI provides interactive exploration, while UCC stores configuration metadata.
Technical Implementation – Instrumentation – Front‑end applications and middleware jars embed the core package; the package exposes APIs (clientSend, serverRecv, etc.) and uses ThreadLocal for context propagation. Byte‑code enhancement enables transparent context transfer across threads and thread pools.
Technical Implementation – Log Format – Logs contain a fixed part (TraceId, RpcId, timestamps, type, IP, latency, result, middleware‑specific data, payload size) and a variable part for custom fields, with dedicated encoders for different scenarios.
Technical Implementation – High‑Performance Log Output – Logs are written to an in‑memory disk to avoid I/O contention, batched asynchronously, and can be dropped under overload to protect business workloads.
Technical Implementation – TP Log Separation – TP metrics are collected for every request, stored separately from link logs, and processed by dedicated bolts to ensure accurate latency statistics.
Technical Implementation – Real‑Time Configuration – Configuration (sampling rates, trace enablement, TP tracking) is managed via UCC and synchronized to local files; the core daemon reloads changes instantly, with fallback to manual file edits if UCC fails.
Storm Stream Processing – All logs pass through Storm bolts for real‑time and offline analysis; real‑time bolts aggregate TP data into JimDB for sub‑hour dashboards, while offline bolts feed Spark/HBase pipelines for long‑term metrics.
Real‑Time Monitoring (Second‑Level Monitoring) – By storing aggregated TP and link metrics in JimDB with expiration, CallGraph delivers near‑real‑time insights (within seconds) for call volume, error rates, and latency distribution.
Future Roadmap – Planned improvements include reducing end‑to‑end latency to true real‑time, enhancing error detection and alerting, expanding middleware support, exposing full APIs for external consumption, and applying deep‑learning techniques to extract richer insights from historical trace data.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.