Which APM Tool Wins? Deep Dive into Zipkin, Pinpoint, and SkyWalking
With micro‑service architectures generating complex call chains across thousands of servers, this article analyzes full‑link monitoring concepts, outlines essential requirements, details core components like spans and traces, and compares three major APM solutions—Zipkin, Pinpoint, and SkyWalking—evaluating probe impact, scalability, and data analysis capabilities.
Background
Micro‑service architectures split applications into many independent services that may be written in different languages, run on thousands of machines, and span multiple data centers. A single user request often traverses dozens of services, creating a complex distributed call chain.
Why Full‑Link Monitoring Is Needed
When a failure occurs, engineers must quickly locate the problematic service and understand performance bottlenecks. Full‑link (or end‑to‑end) monitoring provides a unified view of all calls, metrics such as TPS, latency, and error counts, and helps answer questions like:
How to discover problems quickly?
How to assess fault impact range?
How to evaluate service dependencies?
How to analyze link performance and plan capacity?
Key Requirements for a Full‑Link Monitoring Component
Low probe overhead : The tracing agent must add minimal CPU, memory, and throughput impact.
Low intrusiveness : Instrumentation should be transparent to developers and not require code changes.
Scalability : The collector must scale horizontally to handle large clusters.
Fast data analysis : Real‑time metrics and multi‑dimensional analysis are essential.
Functional Modules
Instrumentation and log generation : Client‑side, server‑side, or bi‑directional tracing points generate logs containing TraceId, SpanId, timestamps, service name, latency, result, and error information.
Log collection and storage : Agents send logs to daemons, which forward them to multi‑level collectors (pub/sub style) and store them for real‑time and offline analysis.
Analysis and aggregation : Span data are grouped by TraceId, ordered by time to form a timeline, and assembled into a call stack. Dependency metrics (strong, high, frequent) are derived.
Visualization and decision support : Dashboards display per‑stage latency, dependency graphs, and alerts.
Core Concepts from Google Dapper
Span
A Span represents a single work unit (e.g., an RPC or DB call) and is identified by a 64‑bit ID. It contains fields such as TraceID, Name, ID, ParentID, Annotations, and a debug flag.
type Span struct {
TraceID int64 // identifies the whole request
Name string
ID int64 // current span ID
ParentID int64 // parent span ID (null for root)
Annotation []Annotation // timestamped events
Debug bool
}Trace
A Trace is a tree of Spans that represents the complete execution of a request from entry to exit, identified by a unique TraceID.
Annotation
Annotations record specific events within a Span, such as client start (cs), server receive (sr), server send (ss), and client receive (cr).
type Annotation struct {
Timestamp int64
Value string
Host Endpoint
Duration int32
}Example Call Flow
A user request reaches front‑end service A, which calls services B and C. Service B returns immediately, while C calls downstream services D and E before responding to A, which finally replies to the user. The tracing system generates a global TraceID, propagates SpanIDs, and records parent‑child relationships for the entire chain.
Typical Deployment Architecture
Agents generate trace logs, which are collected by Logstash and sent to Kafka. Kafka feeds data to downstream processors (e.g., Storm) that aggregate metrics and store them in Elasticsearch. Raw logs are also persisted in HBase for fast TraceID lookup.
Comparison of Three Open‑Source APM Solutions
Zipkin (Twitter): Uses HTTP or MQ for agent‑server communication, supports many languages via the Brave library, stores data in Cassandra.
Pinpoint (Naver): Java‑only agent with bytecode instrumentation, stores data in HBase, communicates via Thrift over UDP.
SkyWalking (Apache): Supports multiple languages, uses gRPC between agents and collectors, stores data in Elasticsearch.
Probe Performance
Benchmarks with a Spring‑Boot application (including Redis and MySQL) showed that SkyWalking’s probe has the smallest impact on throughput, Zipkin is moderate, and Pinpoint reduces throughput noticeably at 500 concurrent users.
Collector Scalability
All three solutions support clustered collectors. Zipkin can scale by adding more server instances that consume MQ topics. SkyWalking’s collector runs in single‑node or cluster mode using gRPC. Pinpoint’s collector uses Thrift and can also be deployed in a cluster.
Data Analysis Capability
Zipkin provides basic service‑level call graphs.
SkyWalking offers 20+ plugin integrations (Dubbo, OkHttp, DBs, MQ) and richer UI.
Pinpoint records the most detailed data, including SQL statements and method‑level spans, and supports custom alerts.
Transparency and Ease of Enablement
Zipkin often requires code changes or library configuration. SkyWalking and Pinpoint rely on bytecode enhancement, allowing agents to be attached without modifying application code.
Topology Visualization
All three tools can automatically discover service topology. Pinpoint’s UI shows detailed DB‑level information, while Zipkin’s topology is limited to service‑to‑service links.
Detailed Zipkin vs. Pinpoint Comparison
Pinpoint provides a full APM stack (probe, collector, storage, UI); Zipkin focuses on collection and storage.
Zipkin’s Brave library offers language‑agnostic APIs; Pinpoint’s agent is Java‑only.
Pinpoint stores data in HBase; Zipkin uses Cassandra.
Tracing vs. Monitoring
Monitoring collects system‑level (CPU, memory, network) and application‑level metrics (QPS, latency, error counts) to detect anomalies and trigger alerts. Tracing builds on call‑chain data to analyze performance, locate bottlenecks, and understand system behavior before failures occur.
References
http://bigbully.github.io/Dapper-translation/
https://github.com/naver/pinpoint/issues/1759
https://github.com/naver/pinpoint/issues/1760
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
