Operations 26 min read

Which APM Tool Wins? Deep Dive into Zipkin, Pinpoint, and SkyWalking

With micro‑service architectures generating complex call chains across thousands of servers, this article analyzes full‑link monitoring concepts, outlines essential requirements, details core components like spans and traces, and compares three major APM solutions—Zipkin, Pinpoint, and SkyWalking—evaluating probe impact, scalability, and data analysis capabilities.

Code Ape Tech Column
Code Ape Tech Column
Code Ape Tech Column
Which APM Tool Wins? Deep Dive into Zipkin, Pinpoint, and SkyWalking

Background

Micro‑service architectures split applications into many independent services that may be written in different languages, run on thousands of machines, and span multiple data centers. A single user request often traverses dozens of services, creating a complex distributed call chain.

Why Full‑Link Monitoring Is Needed

When a failure occurs, engineers must quickly locate the problematic service and understand performance bottlenecks. Full‑link (or end‑to‑end) monitoring provides a unified view of all calls, metrics such as TPS, latency, and error counts, and helps answer questions like:

How to discover problems quickly?

How to assess fault impact range?

How to evaluate service dependencies?

How to analyze link performance and plan capacity?

Key Requirements for a Full‑Link Monitoring Component

Low probe overhead : The tracing agent must add minimal CPU, memory, and throughput impact.

Low intrusiveness : Instrumentation should be transparent to developers and not require code changes.

Scalability : The collector must scale horizontally to handle large clusters.

Fast data analysis : Real‑time metrics and multi‑dimensional analysis are essential.

Functional Modules

Instrumentation and log generation : Client‑side, server‑side, or bi‑directional tracing points generate logs containing TraceId, SpanId, timestamps, service name, latency, result, and error information.

Log collection and storage : Agents send logs to daemons, which forward them to multi‑level collectors (pub/sub style) and store them for real‑time and offline analysis.

Analysis and aggregation : Span data are grouped by TraceId, ordered by time to form a timeline, and assembled into a call stack. Dependency metrics (strong, high, frequent) are derived.

Visualization and decision support : Dashboards display per‑stage latency, dependency graphs, and alerts.

Core Concepts from Google Dapper

Span

A Span represents a single work unit (e.g., an RPC or DB call) and is identified by a 64‑bit ID. It contains fields such as TraceID, Name, ID, ParentID, Annotations, and a debug flag.

type Span struct {
    TraceID    int64        // identifies the whole request
    Name       string
    ID         int64        // current span ID
    ParentID   int64        // parent span ID (null for root)
    Annotation []Annotation // timestamped events
    Debug      bool
}

Trace

A Trace is a tree of Spans that represents the complete execution of a request from entry to exit, identified by a unique TraceID.

Annotation

Annotations record specific events within a Span, such as client start (cs), server receive (sr), server send (ss), and client receive (cr).

type Annotation struct {
    Timestamp int64
    Value     string
    Host      Endpoint
    Duration  int32
}

Example Call Flow

A user request reaches front‑end service A, which calls services B and C. Service B returns immediately, while C calls downstream services D and E before responding to A, which finally replies to the user. The tracing system generates a global TraceID, propagates SpanIDs, and records parent‑child relationships for the entire chain.

Typical Deployment Architecture

Agents generate trace logs, which are collected by Logstash and sent to Kafka. Kafka feeds data to downstream processors (e.g., Storm) that aggregate metrics and store them in Elasticsearch. Raw logs are also persisted in HBase for fast TraceID lookup.

Comparison of Three Open‑Source APM Solutions

Zipkin (Twitter): Uses HTTP or MQ for agent‑server communication, supports many languages via the Brave library, stores data in Cassandra.

Pinpoint (Naver): Java‑only agent with bytecode instrumentation, stores data in HBase, communicates via Thrift over UDP.

SkyWalking (Apache): Supports multiple languages, uses gRPC between agents and collectors, stores data in Elasticsearch.

Probe Performance

Benchmarks with a Spring‑Boot application (including Redis and MySQL) showed that SkyWalking’s probe has the smallest impact on throughput, Zipkin is moderate, and Pinpoint reduces throughput noticeably at 500 concurrent users.

Collector Scalability

All three solutions support clustered collectors. Zipkin can scale by adding more server instances that consume MQ topics. SkyWalking’s collector runs in single‑node or cluster mode using gRPC. Pinpoint’s collector uses Thrift and can also be deployed in a cluster.

Data Analysis Capability

Zipkin provides basic service‑level call graphs.

SkyWalking offers 20+ plugin integrations (Dubbo, OkHttp, DBs, MQ) and richer UI.

Pinpoint records the most detailed data, including SQL statements and method‑level spans, and supports custom alerts.

Transparency and Ease of Enablement

Zipkin often requires code changes or library configuration. SkyWalking and Pinpoint rely on bytecode enhancement, allowing agents to be attached without modifying application code.

Topology Visualization

All three tools can automatically discover service topology. Pinpoint’s UI shows detailed DB‑level information, while Zipkin’s topology is limited to service‑to‑service links.

Detailed Zipkin vs. Pinpoint Comparison

Pinpoint provides a full APM stack (probe, collector, storage, UI); Zipkin focuses on collection and storage.

Zipkin’s Brave library offers language‑agnostic APIs; Pinpoint’s agent is Java‑only.

Pinpoint stores data in HBase; Zipkin uses Cassandra.

Tracing vs. Monitoring

Monitoring collects system‑level (CPU, memory, network) and application‑level metrics (QPS, latency, error counts) to detect anomalies and trigger alerts. Tracing builds on call‑chain data to analyze performance, locate bottlenecks, and understand system behavior before failures occur.

References

http://bigbully.github.io/Dapper-translation/

https://github.com/naver/pinpoint/issues/1759

https://github.com/naver/pinpoint/issues/1760

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MicroservicesAPMPerformance MonitoringDistributed TracingDapperzipkinSkyWalkingPinpoint
Code Ape Tech Column
Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.