How Distributed Tracing Solves Microservice Performance Mysteries with SkyWalking
This article explains the principles and benefits of distributed tracing systems, introduces OpenTracing standards, details SkyWalking’s architecture and mechanisms for automatic span collection, context propagation, unique trace IDs, sampling strategies, and performance impact, and shares practical implementation experiences and custom plugin development within a real‑world microservice environment.
Distributed Tracing System Principles and Role
In microservice architectures, a single request often involves multiple modules, middleware, and machines, making it hard to know which applications, modules, or nodes are involved and in what order, as well as to locate performance problems.
Key performance indicators for an interface include response time (RT), abnormal responses, and identifying the main slowdown.
Monolithic Architecture
Initially, many projects use a monolithic architecture. The simplest way to measure the three indicators is to use AOP to log timestamps before and after business logic execution and to catch exceptions, thereby calculating overall call time and pinpointing error sources.
Another AOP diagram shows how timestamps are printed around the business logic.
Microservice Architecture
When services are split across multiple machines, monitoring metrics become harder to implement. A request may travel through services A → C → B → D, each with several instances, making it difficult to know which specific machine handled each call.
Problems in microservice tracing include difficulty locating issues, long debugging cycles, hard-to-reproduce scenarios, and challenging performance bottleneck analysis.
Automatic data collection
Complete call chain analysis (full trace) : enables issue reproduction
Data visualization of each component’s performance
Distributed tracing can locate each request’s exact path, allowing easy tracing and performance analysis.
Distributed Call Chain Standard – OpenTracing
OpenTracing provides a lightweight, vendor‑agnostic API layer between applications/libraries and tracing or log analysis tools, ensuring compatibility across different tracing systems, similar to how JDBC abstracts database drivers.
OpenTracing’s data model consists of:
Trace : a complete request chain
Span : a single call with start and end timestamps
SpanContext : global context information such as traceId
When a request is made, a global traceId is generated and passed via SpanContext, linking all spans together.
SkyWalking Principles and Architecture
Automatic Span Collection
SkyWalking uses a plugin‑based approach combined with a Java agent to automatically collect span data without code intrusion. Plugins are pluggable and extensible.
Cross‑Process Context Propagation
Context is transmitted via message headers (e.g., HTTP headers or Dubbo attachments) rather than the body. In Dubbo, the attachment acts as a header, carrying the context transparently.
Tip: The context propagation is handled entirely by the Dubbo plugin, invisible to business code.
Ensuring Global Unique traceId
SkyWalking generates IDs locally using the Snowflake algorithm for high performance. To handle clock rollback, it records the last timestamp; if the current time is earlier, a random number is used as the traceId.
Although random IDs could theoretically collide, the probability is negligible, and adding extra uniqueness checks would incur unnecessary performance overhead.
Performance Impact
SkyWalking samples data at a rate of three times per three‑second window, minimizing overhead. Benchmarks at 5000 TPS show CPU, memory, and response time impacts are almost negligible.
Compared with Zipkin and Pinpoint (response times 117 ms and 201 ms respectively), SkyWalking achieves ~22 ms under the same test conditions.
SkyWalking also offers non‑intrusive instrumentation, multi‑language support (Java, .NET Core, PHP, NodeJS, Go, Lua) and a rich plugin ecosystem.
Supports many languages and components such as Dubbo, MySQL, etc.
Extensible: custom plugins can be written following SkyWalking’s guidelines without code intrusion.
Our Company’s Practice on Distributed Call Chains
SkyWalking in Our Architecture
We only use SkyWalking’s agent for sampling, discarding its data storage, reporting, and visualization components because our existing monitoring system (Marvin) already satisfies most needs and replacing it would be costly.
Our Customizations and Practices
Force sampling in pre‑release environments to reproduce issues.
Fine‑grained group sampling (separate sampling for Redis, Dubbo, MySQL, etc.) within each three‑second window.
Embedding traceId into logs using a custom Log4j plugin.
Developed custom plugins for Memcached and Druid, which are not provided by SkyWalking out of the box.
Force Sampling
We add a force_flag=true flag in the request cookie; the gateway propagates it via Dubbo attachment, and the SkyWalking Dubbo plugin forces sampling when this flag is present.
Group Sampling
Default SkyWalking samples the first three requests in a three‑second window, which can miss other component calls. We modified it to perform three samples per component type (Redis, Dubbo, MySQL, etc.) within the same window.
Embedding traceId in Logs
Using Log4j’s plugin mechanism, we define a custom pattern converter that replaces a %traceId placeholder with the current traceId, then configure Log4j to use this converter.
Custom Plugins Development
We created plugins for Memcached and Druid following SkyWalking’s plugin specification, which consists of a definition class, instrumentation (pointcuts), and an interceptor (logic before/after method execution).
For the Dubbo plugin, we enhance the MonitorFilter ’s invoke method to inject the global traceId into the invocation’s attachment before the business logic runs.
The plugin definition is declared in skywalking-plugin.def as follows:
<code>// skywalking-plugin.def file
dubbo=org.apache.skywalking.apm.plugin.asf.dubbo.DubboInstrumentation</code>These enhancements are completely transparent to the application code.
Conclusion
The article provides a comprehensive overview of distributed tracing systems, their role in microservice observability, and the inner workings of SkyWalking, including automatic span collection, context propagation, unique trace ID generation, sampling strategies, performance considerations, and practical customizations. Understanding these concepts helps engineers choose the most suitable tracing solution for their architecture.
Sanyou's Java Diary
Passionate about technology, though not great at solving problems; eager to share, never tire of learning!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.