Operations 18 min read

Design and Implementation of a Distributed Tracing System at Qunar: Architecture, Technical Selection, and Performance Optimizations

This article describes the background, technology selection, architecture design, data flow, monitoring, logging, and trace collection mechanisms of Qunar's self‑built distributed tracing system, analyzes major performance problems such as Flume interruptions, Kafka bottlenecks, Flink back‑pressure, and presents concrete solutions including sliding‑window throttling, CGroup limits, and JavaAgent instrumentation, ultimately improving trace connectivity and system observability.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Design and Implementation of a Distributed Tracing System at Qunar: Architecture, Technical Selection, and Performance Optimizations

Background – As distributed systems grow in scale, Qunar needed a unified observability solution covering monitoring, logging, and tracing. Existing Watcher, Radar, and ELK components lacked a comprehensive distributed tracing capability, prompting the development of a custom APM system based on JavaAgent.

Technical Selection – The observability stack follows the three pillars of cloud‑native monitoring: Prometheus + Grafana for metrics, ELK/Loki for logs, and SkyWalking/Jaeger for tracing. Data ingestion uses Apache Flume and Kafka, processing with Flink, and storage in HBase (with auxiliary MySQL). The UI is built with React.

Architecture Design – Trace collection is achieved via custom middleware instrumentation for critical services and JavaAgent‑based automatic instrumentation for open‑source components. The data pipeline flows from agents → Flume → Kafka → Flink → HBase/MySQL, where aggregated results feed the web UI.

Data Flow Diagram – Shows the end‑to‑end path of trace logs, metrics, and log events through the aforementioned components.

Trace Logging and Reporting – Agents handle trace log generation and upload; Flume is customized to avoid log loss and support per‑line collection. Kafka transports logs to Flink, which aggregates failures, timeouts, and topology information. Metrics are sampled by Watcher agents and linked to trace IDs for joint queries.

UI Presentation – The web UI visualizes call topologies, error rates, slow spans, and related logs by querying HBase and MySQL.

Issues and Solutions

Trace interruption caused by Flume performance limits – solved by expanding memory buffers, converting sinks to asynchronous mode, and applying sliding‑window rate limiting.

Kafka throughput bottleneck – mitigated by increasing partition count, upgrading disks to SSD, and tuning producer/consumer settings.

Flink back‑pressure due to high QPS (≈3 M) – addressed by balancing sub‑tasks, enlarging JVM heap, using in‑memory maps instead of window aggregations, and sharing JVMs across operators.

Trace connectivity loss across threads/processes – resolved with JavaAgent automatic instrumentation (QTracer.wrap) that propagates context through Runnable, Callable, ExecutorService, RxJava, Reactor, etc.

JavaAgent Performance – Benchmarks show that for HTTP requests longer than 50 ms, the agent adds ≤4 % latency and ≤4 % throughput reduction; for cross‑thread scenarios the impact is similar (≈3 %).

Conclusion – The self‑built APM system, after iterative optimization, raised trace connectivity from ~20 % to >80 %, providing a solid foundation for full‑stack observability, chaos engineering, and performance testing.

Code Example

CompletableFuture<Integer> future = CompletableFuture.supplyAsync(new QTraceSupplier<>(()->{
LOG.info("supplyAsync------"+QTraceClientGetter.getClient().getCurrentTraceId());
return 1;
}));
Integer i = future.get();
LOG.info(String.valueOf(i));
CompletableFuture<Void> future1 = CompletableFuture.runAsync(QTracer.wrap(()->{
LOG.info("runAsync------"+QTraceClientGetter.getClient().getCurrentTraceId());
}));
future1.get();
executor.submit(QTracer.wrap(() -> {
LOG.info("in lambda------"+QTraceClientGetter.getClient().getCurrentTraceId());
}));
executor.submit(new Runnable() {
@Override
public void run() {
LOG.info("in lambda------"+"in runnable"+QTraceClientGetter.getClient().getCurrentTraceId());
}
});
performance optimizationFlinkAPMKafkaDistributed TracingJavaAgent
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.