Cloud Native 20 min read

How Distributed Tracing Solves Microservice Performance Mysteries with SkyWalking

This article explains the principles and benefits of distributed tracing systems, introduces OpenTracing standards, details SkyWalking’s architecture and mechanisms for automatic span collection, context propagation, unique trace IDs, sampling strategies, and performance impact, and shares practical implementation experiences and custom plugin development within a real‑world microservice environment.

Sanyou's Java Diary

Jan 8, 2024

How Distributed Tracing Solves Microservice Performance Mysteries with SkyWalking

Distributed Tracing System Principles and Role

In microservice architectures, a single request often involves multiple modules, middleware, and machines, making it hard to know which applications, modules, or nodes are involved and in what order, as well as to locate performance problems.

Key performance indicators for an interface include response time (RT), abnormal responses, and identifying the main slowdown.

Monolithic Architecture

Initially, many projects use a monolithic architecture. The simplest way to measure the three indicators is to use AOP to log timestamps before and after business logic execution and to catch exceptions, thereby calculating overall call time and pinpointing error sources.

Another AOP diagram shows how timestamps are printed around the business logic.

Microservice Architecture

When services are split across multiple machines, monitoring metrics become harder to implement. A request may travel through services A → C → B → D, each with several instances, making it difficult to know which specific machine handled each call.

Problems in microservice tracing include difficulty locating issues, long debugging cycles, hard-to-reproduce scenarios, and challenging performance bottleneck analysis.

Automatic data collection

Complete call chain analysis (full trace) : enables issue reproduction

Data visualization of each component’s performance

Distributed tracing can locate each request’s exact path, allowing easy tracing and performance analysis.

Distributed Call Chain Standard – OpenTracing

OpenTracing provides a lightweight, vendor‑agnostic API layer between applications/libraries and tracing or log analysis tools, ensuring compatibility across different tracing systems, similar to how JDBC abstracts database drivers.

OpenTracing’s data model consists of:

Trace : a complete request chain

Span : a single call with start and end timestamps

SpanContext : global context information such as traceId

When a request is made, a global traceId is generated and passed via SpanContext, linking all spans together.

SkyWalking Principles and Architecture

Automatic Span Collection

SkyWalking uses a plugin‑based approach combined with a Java agent to automatically collect span data without code intrusion. Plugins are pluggable and extensible.

Cross‑Process Context Propagation

Context is transmitted via message headers (e.g., HTTP headers or Dubbo attachments) rather than the body. In Dubbo, the attachment acts as a header, carrying the context transparently.

Tip: The context propagation is handled entirely by the Dubbo plugin, invisible to business code.

Ensuring Global Unique traceId

SkyWalking generates IDs locally using the Snowflake algorithm for high performance. To handle clock rollback, it records the last timestamp; if the current time is earlier, a random number is used as the traceId.

Although random IDs could theoretically collide, the probability is negligible, and adding extra uniqueness checks would incur unnecessary performance overhead.

Performance Impact

SkyWalking samples data at a rate of three times per three‑second window, minimizing overhead. Benchmarks at 5000 TPS show CPU, memory, and response time impacts are almost negligible.

Compared with Zipkin and Pinpoint (response times 117 ms and 201 ms respectively), SkyWalking achieves ~22 ms under the same test conditions.

SkyWalking also offers non‑intrusive instrumentation, multi‑language support (Java, .NET Core, PHP, NodeJS, Go, Lua) and a rich plugin ecosystem.

Supports many languages and components such as Dubbo, MySQL, etc.

Extensible: custom plugins can be written following SkyWalking’s guidelines without code intrusion.

Our Company’s Practice on Distributed Call Chains

SkyWalking in Our Architecture

We only use SkyWalking’s agent for sampling, discarding its data storage, reporting, and visualization components because our existing monitoring system (Marvin) already satisfies most needs and replacing it would be costly.

Our Customizations and Practices

Force sampling in pre‑release environments to reproduce issues.

Fine‑grained group sampling (separate sampling for Redis, Dubbo, MySQL, etc.) within each three‑second window.

Embedding traceId into logs using a custom Log4j plugin.

Developed custom plugins for Memcached and Druid, which are not provided by SkyWalking out of the box.

Force Sampling

We add a force_flag=true flag in the request cookie; the gateway propagates it via Dubbo attachment, and the SkyWalking Dubbo plugin forces sampling when this flag is present.

Group Sampling

Default SkyWalking samples the first three requests in a three‑second window, which can miss other component calls. We modified it to perform three samples per component type (Redis, Dubbo, MySQL, etc.) within the same window.

Embedding traceId in Logs

Using Log4j’s plugin mechanism, we define a custom pattern converter that replaces a %traceId placeholder with the current traceId, then configure Log4j to use this converter.

Custom Plugins Development

We created plugins for Memcached and Druid following SkyWalking’s plugin specification, which consists of a definition class, instrumentation (pointcuts), and an interceptor (logic before/after method execution).

For the Dubbo plugin, we enhance the MonitorFilter ’s invoke method to inject the global traceId into the invocation’s attachment before the business logic runs.

The plugin definition is declared in skywalking-plugin.def as follows:

// skywalking-plugin.def file
dubbo=org.apache.skywalking.apm.plugin.asf.dubbo.DubboInstrumentation

These enhancements are completely transparent to the application code.

Conclusion

The article provides a comprehensive overview of distributed tracing systems, their role in microservice observability, and the inner workings of SkyWalking, including automatic span collection, context propagation, unique trace ID generation, sampling strategies, performance considerations, and practical customizations. Understanding these concepts helps engineers choose the most suitable tracing solution for their architecture.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Microservices Observability performance monitoring OpenTracing Distributed Tracing SkyWalking

Written by

Sanyou's Java Diary

Passionate about technology, though not great at solving problems; eager to share, never tire of learning!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.