Operations 15 min read

How JD’s CallGraph Transforms Distributed Tracing for Real‑Time Operations

CallGraph, JD.com’s in‑house distributed tracing platform, provides low‑intrusion, high‑performance monitoring for micro‑service ecosystems, enabling real‑time call‑graph analysis, TP metrics, flexible configuration, and future extensions such as deep‑learning‑driven insights.

dbaplus Community
dbaplus Community
dbaplus Community
How JD’s CallGraph Transforms Distributed Tracing for Real‑Time Operations

Background and Motivation

With JD’s rapid business growth, the adoption of SOA and micro‑service architectures created a massive, dynamic web of inter‑dependent services. Traditional monitoring could not keep pace, prompting the need for a tool that could automatically capture, reconstruct, and analyze end‑to‑end call flows to support optimization, scaling, throttling, and degradation decisions.

CallGraph Overview

Inspired by Google’s Dapper paper, CallGraph is a JD‑developed, production‑grade distributed tracing system. It shares core capabilities with similar tools (e.g., Taobao Eagle Eye, Sina WatchMan) but adds JD‑specific features such as ultra‑low‑intrusion instrumentation and real‑time analytics.

Core Concept: Call Chain

A call chain records every hop from the initial request (web page, mobile client, etc.) to the final backend component (database, cache). Each request generates a globally unique TraceId that is propagated transparently across services, allowing scattered logs to be correlated and reassembled.

Key Features and Use Cases

CallGraph combines traditional monitoring with advanced capabilities:

Low‑intrusion instrumentation via bytecode enhancement.

Configurable sampling rates to balance data volume and performance.

Separate handling of TP (transaction‑per‑second) metrics and link logs for accurate latency reporting.

Real‑time dashboards (seconds‑level monitoring) powered by Jimdb.

Dynamic, multi‑dimensional configuration (by application, group, IP, etc.) for rapid enable/disable during peak events.

Design Goals

Low Intrusion : No code changes required for business services.

Minimal Performance Impact : In‑memory log buffers and asynchronous batch writes.

Flexible Policy : Per‑application sampling, TP tracking, and on‑the‑fly configuration.

Time‑Efficiency : End‑to‑end latency from log generation to visualization is kept as short as possible.

Architecture

The system consists of several layers:

CallGraph Core Package : Embedded in middleware JARs; generates logs and exposes APIs (startTrace, endTrace, clientSend, serverRecv, etc.).

JMQ : JD’s high‑throughput distributed message queue that transports logs to downstream processors.

Storm : Real‑time stream processing that parses logs, computes metrics, and routes results to storage.

Storage : Real‑time data stored in Jimdb, HBase, Elasticsearch; offline data kept in HDFS and Spark, with summary tables in MySQL.

CallGraph UI : Web interface for visualizing call graphs, dependency maps, entry/exit analysis, and per‑trace details.

UCC : JD’s distributed configuration service that holds all runtime settings.

Metadata Service : Stores mappings between trace signatures, applications, and service methods.

Technical Implementation

Instrumentation and Context Propagation

Instrumentation is performed by enhancing middleware JARs. For HTTP requests, a servlet filter invokes startTrace and endTrace. For RPC frameworks, the core package provides primitives (clientSend, serverRecv, serverSend, clientRecv) that automatically create and transmit a TraceId via ThreadLocal and network headers.

Cross‑process propagation uses the same TraceId embedded in request metadata. Asynchronous calls are handled by bytecode‑enhanced Thread and thread‑pool classes, ensuring child threads inherit the parent context without developer intervention.

Log Format

Each log entry contains a fixed section (TraceId, RpcId, timestamp, call type, peer IP, latency, result, middleware‑specific fields, payload size) and a variable section for custom user data. Encoders specific to each middleware serialize the fields.

High‑Performance Log Output

Logs are written to an in‑memory virtual disk, avoiding physical I/O. A dedicated log module batches writes asynchronously and can drop logs under overload to protect the business workload.

TP vs. Link Logs

TP (transaction‑per‑second) metrics are sampled at a lower rate (e.g., 1/1000) but always recorded for every call to preserve accurate latency percentiles. Link logs are sampled more aggressively. Separate pipelines process each type.

Real‑Time Configuration

During high‑traffic events (e.g., Double‑11), operators can adjust sampling rates, enable/disable tracing, or toggle TP collection via the UI. Settings are stored in UCC and synchronized to local config files; a daemon thread in the core package reloads them periodically.

Storm Stream Processing

All logs flow through Storm bolts. Link‑log bolts aggregate call counts, average latency, error rates, and store results in HBase/HDFS for offline analysis. TP‑log bolts compute per‑second metrics and write them to Jimdb for the UI’s second‑level dashboards.

Future Roadmap

Reduce end‑to‑end latency to true real‑time by optimizing the log‑collection‑Storm path.

Introduce automated error detection and alerting.

Support additional middleware to extend trace coverage.

Expose a public API for other teams to consume trace data.

Apply deep‑learning models to historical trace data for predictive insights.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringDistributed TracingLog Processing
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.