How Distributed Tracing Powers Modern Microservices: From Zipkin to EagleEye
This article explains why distributed systems need tracing, outlines design goals, compares major implementations like Zipkin, EagleEye, and Hydra, and details the data collection, storage, and analysis pipelines that enable end‑to‑end visibility and performance optimization in large‑scale services.
Distributed Tracing Systems
Why do distributed systems need tracing?
In an e‑commerce platform composed of hundreds of distributed services, each request leaves footprints across multiple business systems and accesses various caches or databases. Collecting and analyzing these scattered logs is essential for troubleshooting and process optimization. The goal of distributed tracing is to track the complete call chain of each request, gather performance data from every service, compute metrics against SLAs, and eventually feed this information back into service governance.
Industry examples include Twitter's Zipkin, Alibaba's EagleEye, JD.com's Hydra, eBay's CAL, and others, all inspired by Google’s Dapper paper, just as Hadoop originated from Google’s MapReduce paper.
Typical design goals for a tracing system are:
Low invasiveness – the tracing component should be transparent to business code and impose minimal burden on developers.
Flexible collection policy – the scope and granularity of data can be adjusted at any time.
Timeliness – data collection, processing, and visualization must happen as quickly as possible.
Decision support – the data should aid DevOps decisions.
Effective visualization.
Visual examples
Below are screenshots of the call‑chain UI for Zipkin, EagleEye, and Hydra.
Hovering over each layer of the call chain reveals execution time, host IP, database operations, input parameters, and even error stacks.
How Alibaba implements tracing (EagleEye)
In EagleEye, a request’s entire call chain is called a "trace". When a request arrives, a filter‑like component assigns a globally unique TraceId and stores it in a ThreadLocal context. An additional identifier, RpcId, records the order and nesting of RPC calls within the same trace. The front‑end request always starts with RpcId = 0.
When the front‑end service makes an RPC call, the HSF client retrieves the current context from ThreadLocal, increments RpcId (e.g., from 0 to 0.1), and attaches both TraceId and RpcId to the outgoing request. The receiving HSF server extracts the context, places it in its own ThreadLocal, and repeats the process for downstream calls, generating hierarchical RpcIds such as 0.1.1. After processing, the server logs TraceId, RpcId, and other metadata, then clears the ThreadLocal.
This mechanism enables accurate reconstruction of the entire call chain from logs, supports asynchronous or broadcast calls, and simplifies debugging.
How JD.com implements tracing (Hydra)
JD adopts the open‑source Dubbo RPC framework; Hydra builds on Dubbo to achieve near‑zero intrusion. The domain model and storage architecture are illustrated below.
Hydra stores trace data in HBase.
How the internal "Wowo" system implements tracing
In 2012 the team recognized the need for a unified tracing system and decided to instrument RPC frameworks rather than individual services. By 2013 they began development, using Java‑agent bytecode instrumentation to automatically weave tracing logic into annotated methods, storing trace context in ThreadLocal.
Instrumentation via Javaagent: premain loads a bytecode transformer that injects tracing code into methods marked with a @Trace annotation.
Data aggregation: trace logs are sent in real time to Flume agents, then to a Flume collector.
Storage: Flume sinks write data to HDFS and HBase; temporary files in HDFS are rolled into "done" files every five minutes.
Analysis: a load job moves completed files into Hive partitions; an analysis job runs every five minutes to generate statistics such as request counts per layer and latency distributions.
Visualization: a Python/Django front‑end displays performance curves, anomaly charts, and per‑service metrics.
These visualizations allow engineers to pinpoint slow components, categorize exceptions (memcached, redis, mongodb, mysql, runtime, fail), and compare current performance against historical baselines.
The Wowo tracing system is now integrated into the Operations Automation Platform (OAP).
References
Using Zipkin and Brave for distributed tracing (basic guide)
OpenZipkin project
Design and implementation of Twitter Zipkin
OpenZipkin/Brave Java implementation
OpenZipkin/Zipkin repository
#R&D solution introduction# Tracing (EagleEye)
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
