Design and Comparison of Distributed Tracing Systems

The article explains the concept, functions, design goals, data models, log collection, and deployment considerations of distributed tracing systems, and compares several open‑source and proprietary solutions such as Dapper, Zipkin, Pinpoint, Alibaba Eagle Eye, and JD Hydra to guide the selection of an appropriate tracing platform.

Architecture Digest
Architecture Digest
Architecture Digest
Design and Comparison of Distributed Tracing Systems

Why Distributed Tracing Is Needed

As internet architectures expand, distributed systems become increasingly complex, with many components—micro‑services, messaging, distributed databases, caches, object storage, cross‑domain calls—forming a tangled network. When a request fails somewhere in this chain, developers often have to inspect logs of each service manually, which is highly inefficient.

Distributed tracing reconstructs a request’s path across services, showing node‑level latency, the exact machine handling each call, and the status of every service node.

Functions of a Tracing System

1. Fast Fault Localization

By attaching a trace ID to business logs, the complete logical trajectory of a request can be displayed, allowing developers to quickly locate errors using the trace ID together with business logs.

2. Performance Analysis of Each Call Segment

Adding latency information at each segment of the trace enables identification of performance bottlenecks; average latency, QPS, and other metrics help pinpoint weak points for optimization such as data redundancy.

3. Business Data Correlation

Binding business data to a trace reveals user behavior paths across services, supporting aggregated analysis for many scenarios.

4. Service Topology Visualization

Visualizing modules and their relationships produces a service topology map; clicking a node shows its details, current status, and request volume.

Design Goals of a Distributed Tracing System

Low intrusion and transparency: the tracing component should be non‑intrusive or minimally intrusive to business code.

Low overhead: tracing should impose minimal performance cost, often achieved by sampling a subset of requests.

Large‑scale deployment and scalability: the system must support distributed deployment and scale with the overall architecture.

Instrumentation and Log Generation

Instrumentation (or “spans”) captures context at each node and can be client‑side, server‑side, or bidirectional. Typical log fields include TraceId, RpcId, start time, call type, protocol, caller IP/port, service name, latency, result, exception info, message payload, and extensible custom fields.

Log Collection and Storage

Logs are usually collected via open‑source tools (e.g., Flume + Kafka) using both offline and real‑time pipelines.

Trace Data Analysis and Statistics

Logs from all servers are aggregated by TraceId, then ordered by RpcId to reconstruct the call chain, tolerating occasional missing logs.

Computation and Presentation

Aggregated trace logs are stored in HBase or relational databases for visual querying and analysis.

Trace Model Terminology

Trace: a complete distributed call chain.

Span: a single service call; multiple spans form a tree representing a Trace.

Annotation: a timestamped event within a span.

BinaryAnnotation: a user‑defined key‑value annotation.

Standard annotation types: CLIENT_SEND (Cs), CLIENT_RECV (Cr), SERVER_RECV (Sr), SERVER_SEND (Ss).

User‑defined types: Event (general events), Exception (error events), Client & Server (roles in cross‑service calls).

Tracing System Options

Major internet companies have built their own tracing systems: Google Dapper, Twitter Zipkin, Alibaba Eagle Eye, JD Hydra, etc.

Google Dapper

Design goals: low overhead, application‑level transparency, scalability. Data flow: services write spans to local logs → Dapper daemon pulls logs → collector writes to Bigtable.

Alibaba Eagle Eye

Described via internal sharing; uses local log files and background agents for collection, offering low performance impact but requiring agents on every server.

Alibaba EDAS + ARMS

EDAS handles application control; ARMS focuses on business‑level monitoring, together forming a three‑dimensional monitoring system.

Dianping CAT

Simple architecture that implements all trace functions; uses Transaction as the primary event type, supports nested transactions, and integrates with internal RPC frameworks.

JD Hydra

Integrates with Dubbo; supports adaptive sampling (e.g., 10% when QPS > 100). Uses HBase for storage; project stopped maintenance in 2013.

Twitter OpenZipkin

Provides similar functionality to Hydra; open‑source community offers Scala, Java, Node, Go, Python, Ruby, C# clients. Uses Brave for Java instrumentation and stores data in Cassandra.

Current State of Trace Systems

Most solutions meet basic tracing needs, but many are closed‑source (Google, Alibaba) or unmaintained (JD, Dianping). OpenZipkin requires Scala and Finagle, raising integration costs. Organizations may either adopt an existing open‑source project (e.g., Zipkin, Pinpoint) or develop a custom solution, considering cross‑platform support (Java, .NET) and minimal intrusion.

Comparison: Zipkin vs. Pinpoint

Both stem from Dapper’s paper. Pinpoint offers a full‑stack monitoring suite (agents, collectors, storage, UI) with Java‑agent bytecode injection for near‑zero code changes, storing data in HBase. Zipkin focuses on collector and storage, provides a Query API, and supports many languages via Brave; it stores data in Cassandra.

Integration difficulty: Brave (Zipkin) requires modest configuration; Pinpoint’s agents need deeper knowledge of target libraries. Cost estimates suggest Brave development cost 20, integration 10, versus Pinpoint development 100, integration 0, leading to a 5:1 cost ratio favoring Zipkin for many services.

Other considerations include log collection methods (direct send, queue, ElasticSearch), data cleaning pipelines (Logstash, Storm, Spark Streaming), storage back‑ends (MySQL, HBase, ElasticSearch), and UI choices (custom vs. third‑party).

Source: http://www.cnblogs.com/zhangs1986/p/8879744.html
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendMicroservicesObservabilityPerformance MonitoringDistributed Tracingtrace systems
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.