Which APM Tool Wins? A Deep Comparison of Zipkin, SkyWalking, and Pinpoint
This article analyzes full‑link monitoring in micro‑service architectures, outlines the goals and functional modules of tracing systems, explains core concepts such as Span, Trace, and Annotation, and then compares Zipkin, SkyWalking, and Pinpoint across performance impact, scalability, data analysis depth, developer transparency, and topology visualization.
With micro‑service architectures becoming mainstream, a single request often traverses many services, possibly written in different languages and deployed across thousands of servers. To quickly locate and resolve failures, full‑link monitoring tools—originally inspired by Google’s Dapper paper—are required.
1. Goals and Requirements
The tracing component should have minimal performance overhead, be non‑intrusive to application code, scale horizontally, provide fast data analysis, and support rich dependency metrics.
2. Functional Modules
Typical full‑link monitoring systems consist of four modules:
Instrumentation and log generation (client/server or bidirectional).
Log collection and storage (often using a message queue as a buffer).
Analysis and aggregation of trace data (real‑time and offline).
Visualization and decision‑support dashboards.
3. Core Concepts from Google Dapper
3.1 Span
A Span represents a single unit of work in a trace and is identified by a 64‑bit ID. It contains fields such as TraceID, SpanID, ParentID, name, timestamps, annotations, and optional debug flags.
type Span struct { TraceID int64 // identifies the whole request Name string ID int64 // current span ID ParentID int64 // parent span ID, null for root Annotation []Annotation // timestamps and events Debug bool }3.2 Trace
A Trace is a tree of Spans that together represent the complete request flow from client request to final response, identified by a unique TraceID.
3.3 Annotation
Annotations record specific events within a Span, typically four types: cs (Client Start), sr (Server Receive), ss (Server Send), and cr (Client Received).
type Annotation struct { Timestamp int64 Value string Host Endpoint Duration int32 }3.4 Call Example
When a user request reaches front‑end service A, it may invoke services B and C via RPC. Service B returns immediately, while C further calls D and E before responding. The entire flow is captured by a global TraceID and a hierarchy of SpanIDs.
4. Deployment Architecture
Agents can be deployed without code changes. Two main agent types exist:
In‑process Java agents that instrument methods via the JVM’s javaagent mechanism.
Cross‑service agents that provide plugins for popular RPC frameworks (Dubbo, REST, custom RPC).
Supported plugins include:
Dubbo
REST
Custom RPC
5. Benefits of Full‑Link Monitoring
Accurate visibility of production deployments.
Identification and optimization of critical call paths.
Quantifiable performance data for IT operations.
Rapid pinpointing of code‑level performance issues.
Support for white‑box testing and reduced time‑to‑stability.
6. Solution Comparison
The three open‑source APM solutions examined are Zipkin (Twitter), Pinpoint (Naver), and SkyWalking (Apache). The comparison focuses on five dimensions:
Probe performance impact.
Collector scalability.
Depth of call‑chain data analysis.
Developer transparency and ease of enable/disable.
Automatic topology discovery.
6.1 Probe Performance
Using a Spring‑based benchmark (Spring Boot, MVC, Redis, MySQL) with JMeter at 500, 750, and 1000 concurrent users, the throughput impact was measured. SkyWalking showed the smallest throughput loss, Zipkin was moderate, while Pinpoint reduced throughput significantly (e.g., from 1385 TPS to 774 TPS at 500 users). CPU and memory overhead stayed around 10 % for all three.
6.2 Collector Scalability
All three support horizontal scaling. Zipkin can run multiple server instances consuming messages from a queue. SkyWalking’s collector works in single‑node or cluster mode via gRPC. Pinpoint uses Thrift and also supports clustered deployment.
6.3 Data Analysis Depth
Zipkin provides service‑level latency but lacks fine‑grained method details. SkyWalking captures >20 middleware/frameworks (Dubbo, OkHttp, DB, MQ) and shows richer call graphs. Pinpoint records the most detailed data, including SQL statements and method‑level spans, offering the deepest visibility.
6.4 Developer Transparency
Zipkin requires code changes or library integration (Brave). SkyWalking and Pinpoint rely on byte‑code instrumentation, so no source modifications are needed. Pinpoint’s Java agent is completely non‑intrusive, while Zipkin’s approach can be more invasive.
6.5 Topology Visualization
All three generate service‑level topology maps. Pinpoint’s UI shows detailed DB and method nodes, Zipkin’s view is limited to service‑to‑service links, and SkyWalking offers a middle ground with extensive middleware support.
6.6 Pinpoint vs. Zipkin Detailed Comparison
Pinpoint provides a full APM stack (probe, collector, storage, UI) whereas Zipkin focuses on collection and storage with a lighter UI. Pinpoint uses Java agents for byte‑code injection, offering deeper data (additional SpanEvent layer) but requires more expertise to develop custom plugins. Zipkin’s Brave library offers a simpler API and broader language support but needs explicit code integration.
7. Tracing vs. Monitoring
Monitoring collects system‑level metrics (CPU, memory, network) and application‑level metrics (QPS, latency, error rates) to detect anomalies. Tracing builds on monitoring by capturing the full call chain, enabling root‑cause analysis before incidents become visible.
8. Conclusion
In the short term, Pinpoint excels with zero‑code deployment, method‑level granularity, and a powerful UI. However, its ecosystem is smaller, its storage relies on HBase, and extending it to new frameworks can be costly. Zipkin benefits from a large community, simple REST/JSON interfaces, and easier integration, though it provides coarser data. SkyWalking offers a balanced solution with moderate overhead, broad middleware support, and good scalability. Teams should choose based on required granularity, existing technology stack, and long‑term maintenance considerations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
