Big Data 22 min read

Building and Optimizing a Distributed Tracing System for Qunar Travel: APM Architecture, Performance Bottlenecks, and Solutions

This article details Qunar Travel's end‑to‑end design and optimization of a distributed tracing system within its APM platform, covering architecture choices, log‑collection and Kafka transmission bottlenecks, Flink task tuning, and the business value derived from trace and metric analysis.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Building and Optimizing a Distributed Tracing System for Qunar Travel: APM Architecture, Performance Bottlenecks, and Solutions

Distributed tracing systems play a crucial role in an enterprise APM ecosystem. This article shares Qunar Travel's practical experience in building such a system, starting from overall APM architecture design and describing performance‑optimizing practices and pitfalls in log collection, Kafka transmission, and Flink processing.

Background: Since 2012, open‑source APM components like SkyWalking and Jaeger have emerged, making distributed tracing indispensable for many companies. Qunar Travel, an OTA platform with search and transaction services, adopts a full‑sampling strategy for transaction traces due to its moderate traffic volume.

APM Architecture: The system combines agent‑based instrumentation and middleware‑level tracing, with log collection handled by a customized Apache Flume, data transport via Kafka, processing by Flink, and metrics stored in a time‑series database displayed through Prometheus + Grafana. The architecture diagram (see image) highlights the red‑marked areas that required major refactoring.

Log‑Collection Bottlenecks: Issues included trace data interruption, limited asynchronous queue length, single‑threaded log reading, and synchronous Kafka sending. Optimizations added a longer async queue, batch reading, and asynchronous Kafka sends, boosting throughput to 80‑100 billion records per minute.

Kafka Cluster Issues: Under high concurrency, connection failures and thread starvation occurred due to exhausted network connections and overloaded request processors. Removing problematic nodes and increasing memory restored stability, achieving over 170 million messages per minute.

Flink Task Optimization: The massive QPS (400 w‑1 M) data flow required careful task splitting to avoid back‑pressure. Strategies included balancing sub‑task consumption, ensuring sufficient memory for operators, replacing windows with in‑memory maps, and using shared groups to reduce network overhead. After tuning, average write throughput reached 4 M QPS with ~600 ms latency.

Value of APM and Trace Data: APM helps untangle complex service topologies and pinpoint problems more precisely than pure monitoring. By correlating trace and metric data—using time, count, and rate‑based sampling strategies—developers can quickly locate anomalies, reduce debugging time, and improve system reliability.

Conclusion and Outlook: The three core components—log collection, transmission link management, and Flink task optimization—address high‑volume, high‑concurrency challenges. Continuous exploration of trace data value will further enhance performance, user experience, and risk mitigation for large‑scale distributed systems.

performance optimizationBig DataFlinkAPMKafkaDistributed Tracing
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.