Operations 13 min read

Mastering Distributed Tracing: From Dapper to Zipkin and OpenTracing

This article explains the fundamentals of distributed tracing, covering the original Dapper concepts, the architecture and data model of Zipkin, sampling strategies, storage mechanisms, and how OpenTracing provides a vendor‑neutral API for integrating tracing into microservice systems.

Architecture Talk
Architecture Talk
Architecture Talk
Mastering Distributed Tracing: From Dapper to Zipkin and OpenTracing

Why Distributed Tracing?

Modern microservice architectures consist of many distributed components, making it difficult to locate failures when a request traverses multiple services. Distributed tracing reconstructs a request’s call chain, showing per‑service latency, target machines, and request status.

Dapper – The Origin

Dapper, introduced by Google, defines the concepts, data representation, instrumentation, propagation, collection, storage, and visualization for tracing in microservice systems. Major tracing systems such as Zipkin, Jaeger, Alibaba Eagle Eye, and Meituan Mtrace are inspired by Dapper.

Trace, Span, and Annotations

A Trace represents the entire request path across services and is identified by a globally unique trace_id. A Span denotes a single operation within the trace; spans form a parent‑child tree where each span has a unique span_id and a parent_id. Annotations are user‑defined events attached to spans, e.g., custom markers like "foo".

Spans are linked by parent‑child relationships; the edge between spans represents an RPC call. Each span records client‑send (cs), server‑receive (sr), server‑send (ss), and client‑receive (cr) events, forming a complete call record.

In‑band vs. Out‑of‑band Data

In‑band data (trace_id, span_id, parent_id) travels with the request across services, allowing the entire trace to be reconstructed. Out‑of‑band data (cs, ss, etc.) are generated locally at each node and later reported to a central store.

Sampling

To reduce overhead, Dapper samples spans instead of reporting every one. Sampling rates are adjusted adaptively; only spans marked as Accept are reported, while Deny discards them. Debug forces reporting for development.

Storage

Dapper stores span data in Google BigTable, which efficiently handles sparse rows keyed by trace_id and span_id. This design enables stateless collection and fast row‑based queries.

Zipkin – An Open‑Source Dapper Implementation

Zipkin implements Dapper’s concepts and provides a plug‑and‑play tracing system. Its architecture consists of Reporter, Transport, Collector, Storage, API, and UI.

Reporter runs in each service, generating spans, propagating in‑band data, reporting out‑of‑band data, and handling sampling. Transport supports HTTP and Kafka. Collector receives out‑of‑band data and writes it to Storage, which can be in‑memory, MySQL, Cassandra, or Elasticsearch. API offers query and ingestion endpoints, while UI visualizes traces.

Data Model (Zipkin v2)

The Span model includes fields such as trace_id, id, parent_id, kind, name, timestamp, duration, local_endpoint, remote_endpoint, annotations, tags, debug, and shared.

message Span {
  bytes trace_id = 1;
  bytes parent_id = 2;
  bytes id = 3;
  enum Kind { SPAN_KIND_UNSPECIFIED = 0; CLIENT = 1; SERVER = 2; PRODUCER = 3; CONSUMER = 4; }
  Kind kind = 4;
  string name = 5;
  fixed64 timestamp = 6;
  uint64 duration = 7;
  Endpoint local_endpoint = 8;
  Endpoint remote_endpoint = 9;
  repeated Annotation annotations = 10;
  map<string, string> tags = 11;
  bool debug = 12;
  bool shared = 13;
}

message Endpoint {
  string service_name = 1;
  bytes ipv4 = 2;
  bytes ipv6 = 3;
  int32 port = 4;
}

message Annotation {
  fixed64 timestamp = 1;
  string value = 2;
}

B3 Propagation and Sampling Flags

Zipkin uses b3‑propagation to carry TraceId, SpanId, ParentSpanId, and Sampled across services, typically via custom HTTP headers (e.g., X‑B3‑TraceId) or gRPC metadata.

Sampling flags can be Defer (undecided), Deny (drop), Accept (report), or Debug (force report).

Instrumentation and Reporting Flow

The following steps illustrate a simple three‑service call chain:

Server‑1 creates a root span (CLIENT) and propagates trace data to Server‑2.

Server‑2 receives the request, creates a matching SERVER span.

Server‑2 calls Server‑3, generating a child CLIENT span.

Server‑3 creates a SERVER span for the incoming request.

Server‑3 records duration and reports its span.

Server‑2 records its span duration after receiving the response.

Server‑2 reports its span.

Server‑1 records and reports its span after the final response.

OpenTracing – Vendor‑Neutral API

OpenTracing defines a language‑agnostic, vendor‑agnostic API that lets developers instrument code once and switch tracing backends (e.g., Zipkin, Jaeger) without code changes. Implementations adapt the API to specific backends.

References are listed at the end of the original article.

backendmicroservicesOpenTracingDistributed TracingDapperzipkin
Architecture Talk
Written by

Architecture Talk

Rooted in the "Dao" of architecture, we provide pragmatic, implementation‑focused architecture content.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.