Operations 9 min read

Design and Implementation of a Low‑Impact Distributed Tracing System for Service Calls

This article describes the background, design goals, architecture, implementation details, and lessons learned from building a low‑overhead, low‑intrusion distributed tracing system using Kafka, Elasticsearch, and OpenTracing to monitor microservice interactions and support performance analysis and DevOps decision‑making.

Hujiang Technology
Hujiang Technology
Hujiang Technology
Design and Implementation of a Low‑Impact Distributed Tracing System for Service Calls

Background

As the company's services grew rapidly, the call relationships between services became increasingly complex, making it critical to trace and monitor request flows across multiple microservices, databases, and caches for troubleshooting and process optimization.

Design Goals

Low overhead: tracing should have minimal impact on highly optimized services.

Low intrusion: the tracing component should be transparent and require little developer effort.

Timeliness: data collection, processing, and visualization must be fast.

Decision support: provide useful metrics for DevOps decisions.

Data visualization: enable visual filtering without reading raw logs.

Implemented Functions

Fault location: full request trace displayed.

Performance analysis: per‑segment latency to identify bottlenecks.

Data analysis: complete business log for behavior path aggregation.

Design Approach

The solution follows the distributed tracing model popularized by Google Dapper and implemented in open‑source projects such as Twitter Zipkin and Alibaba EagleEye. By linking all spans of a request, the system provides end‑to‑end visibility.

Typical Distributed Call Process

A request originates from a client, passes through a front‑end service (A), then to intermediate services (B, C), and finally to back‑ends (D, E). Each RPC is instrumented to emit trace events.

cs - CLIENT_SEND, 客户端发起请求
sr - SERVER_RECIEVE, 服务端收到请求
ss - SERVER_SEND, 服务端处理完成,发送结果给客户端
cr - CLIENT_RECIEVE, 客户端收到响应

Technical Selection

Considering the company's HTTP‑centric scenario, the design adopts the Zipkin implementation philosophy and follows the OpenTracing standard for multi‑language compatibility.

System Design

Overall Architecture

The tracing system consists of four main components: data instrumentation, data transmission, data storage, and a query UI.

Data Instrumentation

Integrate an SDK into the unified development framework for low‑intrusion data collection.

Use AOP to store trace data in a ThreadLocal variable, keeping the application transparent.

Record TraceId, service name, endpoint, start time, and duration.

Send data asynchronously to a Kafka queue to minimize impact on business logic.

Supported middleware includes HTTP, MySQL, and RabbitMQ.

Data Transmission

A Kafka layer between the SDK and backend services decouples components and buffers data, preventing loss during traffic spikes at the cost of some latency.

Data Storage

Spans and annotations are stored in Elasticsearch, retaining the most recent month of data to balance storage cost and query performance.

Query Interface

A web UI visualizes the distributed call graph, offering trace trees, dependency analysis, and project‑level aggregation.

Challenges Encountered

Web Page Load Timeout

Loading all spans at once caused timeouts for projects with millions of spans; the UI was rewritten to lazy‑load the latest ten spans and support dynamic search.

Span Accumulation

When HTTP client timeouts were not intercepted, spans remained in ThreadLocal, leading to thousands of entries; the SDK was updated to catch timeout exceptions and clean up the thread‑local storage.

Conclusion

By generating a globally unique TraceID for each request and linking all participating services, the tracing system enables call‑path analysis, performance bottleneck identification, and rapid fault isolation, providing valuable support for DevOps and operational decision‑making.

References

Google Dapper – http://bigbully.github.io/Dapper-translation/

Twitter Zipkin – http://zipkin.io/

Tracing article – http://www.cnblogs.com/zhengyun_ustc/p/55solution2.html

MicroservicesElasticsearchObservabilityDevOpsKafkaOpenTracingDistributed Tracing
Hujiang Technology
Written by

Hujiang Technology

We focus on the real-world challenges developers face, delivering authentic, practical content and a direct platform for technical networking among developers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.