How Does Distributed Link Tracing Work? Inside SkyWalking’s Architecture
This article explains the concept of distributed link tracing, its principles, metrics, and implementation details—including monolithic and microservice approaches, OpenTracing standards, and how SkyWalking solves challenges like automatic span collection, context propagation, unique trace IDs, and sampling performance.
In distributed systems, especially microservice architectures, a single external request often triggers multiple internal modules, middleware, and machines. Determining which applications, modules, and nodes are involved, their order, and performance is the challenge addressed by link tracing.
What Is Link Tracing?
Link tracing reconstructs a distributed request into a call chain, displaying each service node’s latency, target machine, and request status.
Principles of Link Tracing
Key metrics for an interface include:
Response time (RT)
Exception responses
Location of slow requests
Monolithic Architecture
In early stages, systems are monolithic. Using AOP (Aspect‑Oriented Programming), we can record start and end times around business logic to calculate total latency and capture exceptions with minimal code intrusion.
Microservice Architecture
As services grow, they split into microservices. When a page is slow, the request may traverse A → C → B → D across many machines, making it hard to pinpoint the problematic service or node.
Link tracing solves three main pain points:
Difficult and lengthy issue diagnosis
Hard-to‑reproduce scenarios
Complex performance bottleneck analysis
It automatically collects data, builds a complete call chain, and visualizes component performance.
OpenTracing Standard
OpenTracing provides a lightweight, vendor‑agnostic API layer between applications and tracing systems, similar to JDBC’s standard interface.
Its data model consists of:
Trace : a complete request chain
Span : a single call with start and end timestamps
SpanContext : global context (e.g., traceId) passed between spans
These concepts enable distributed tracing systems to capture and correlate calls across services.
Collector Role
The collector gathers:
Global
trace_id span_idto identify each call parent_span_id to link child calls to their parents
Collected data is stored in Elasticsearch, MySQL, etc., for visualization.
SkyWalking Architecture
SkyWalking uses a plugin‑based Java agent to automatically collect spans without code changes. Context is propagated via headers (e.g., Dubbo attachments). It generates unique trace IDs using a Snowflake‑like algorithm, handling clock rollback by falling back to random IDs.
Sampling is performed (default 3 samples per 3 seconds) to reduce overhead, and upstream‑sampled contexts force downstream collection to ensure complete traces.
Performance Evaluation
Benchmarks show SkyWalking adds negligible CPU, memory, and latency overhead at 5000 TPS. Compared with Zipkin and Pinpoint, SkyWalking achieves significantly lower response times (22 ms vs. 117 ms and 201 ms) and offers non‑intrusive instrumentation.
Additional advantages include multi‑language support (Java, .NET Core, PHP, Node.js, Go, Lua) and extensible plugins.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect's Guide
Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
