Cloud Native 18 min read

How Qunar Built a Scalable Distributed Tracing System for Cloud‑Native Observability

This article details Qunar's end‑to‑end design and implementation of a distributed tracing platform, covering background, technology selection, architecture, data flow, performance bottlenecks, and concrete solutions such as Flume tuning, Kafka scaling, Flink back‑pressure handling, and JavaAgent instrumentation to achieve high trace connectivity and low failure rates.

dbaplus Community
dbaplus Community
dbaplus Community
How Qunar Built a Scalable Distributed Tracing System for Cloud‑Native Observability

Background

Large‑scale distributed systems require unified observability to detect faults, analyse performance and verify health. Qunar already operated Watcher (metrics & alerts), Radar (predictive alerts) and an ELK‑based logging platform, but lacked a distributed tracing layer.

Technical selection

Observability was split into three pillars:

Monitoring : Prometheus + Grafana.

Logging : ELK for critical logs, Loki for high‑volume streams, with Elasticsearch/ClickHouse for hot data and HDFS for cold data.

Tracing : evaluated SkyWalking (Huawei) and Jaeger; both support Java‑agent instrumentation, matching Qunar’s Java‑centric stack (also Python and Go services).

Architecture design

Trace injection is performed by a custom middleware layer; open‑source Java agents collect trace data. The data pipeline is:

Agents write trace logs.

Apache Flume (customised for line‑level collection, log rotation and per‑application config) ships logs to Kafka.

Kafka buffers the stream.

Flink consumes the Kafka topic, aggregates spans, builds call topologies and computes failure metrics.

Aggregated results are persisted in HBase (trace details) and MySQL (metadata).

A React‑based UI queries HBase/MySQL to visualise topologies, trace chains, error rates and slow spans.

Data flow

Trace logging & reporting – agents generate trace logs; Flume ships them to Kafka.

Log upload – Kafka transports logs to Flink, which aggregates topology and failure metrics.

Monitoring reporting – Watcher agents sample metrics at trace points, linking metrics to traces for alerting.

UI display – HBase and MySQL store trace and topology data; the UI queries them for error rates, slow spans, etc.

Related‑log display – ELK indexes raw logs; trace IDs in log headers enable fast correlation.

Issues and solutions

1. Trace interruption – Flume

Flume could not keep up with the write rate, causing memory‑queue overflow and OOM, with interruption rates up to 80 %.

Enlarge the Flume memory queue.

Make the sink asynchronous.

Increase TailRead concurrency.

Apply cgroup limits to cap CPU and memory usage.

These changes reduced failure rates to ~20 %.

2. Trace interruption – Kafka

Kafka suffered back‑pressure: idle thread count dropped, network connections fell and write latency spiked.

Increase the number of partitions.

Replace slow HDDs with SSDs.

Scale the number of Processor threads.

3. Trace interruption – Flink back‑pressure

Flink processed ~3 M QPS; upstream producers outpaced downstream operators, causing buffer overflow.

Balance sub‑task load across the cluster.

Increase JVM heap for Flink operators.

Replace window aggregation with memory‑map aggregation.

Compress data between operators.

Share JVMs via Flink’s shareGroup to reduce network overhead.

4. Trace connectivity loss

Cross‑process or cross‑thread context loss broke trace graphs.

Hard‑code adapters for internal RPC frameworks (Dubbo, QMQ, etc.) and common HTTP clients.

Deploy a Java‑Agent ( QTraceAgent) that automatically propagates trace context across threads (Runnable, Callable, Future, RxJava, Reactor, etc.).

Sample instrumentation:

CompletableFuture<Integer> future = CompletableFuture.supplyAsync(new QTraceSupplier<>(() -> {
    LOG.info("supplyAsync------" + QTraceClientGetter.getClient().getCurrentTraceId());
    return 1;
}));
Integer i = future.get();
LOG.info(String.valueOf(i));
CompletableFuture<Void> future1 = CompletableFuture.runAsync(QTracer.wrap(() -> {
    LOG.info("runAsync------" + QTraceClientGetter.getClient().getCurrentTraceId());
}));
future1.get();
executor.submit(QTracer.wrap(() -> {
    LOG.info("in lambda------" + QTraceClientGetter.getClient().getCurrentTraceId());
}));
executor.submit(new Runnable() {
    @Override
    public void run() {
        LOG.info("in runnable------" + QTraceClientGetter.getClient().getCurrentTraceId());
    }
});

Performance impact of the Java‑Agent

Benchmarks on production workloads showed:

For HTTP requests > 50 ms, throughput dropped ≤ 4 % and latency increased ≤ 4 %.

For cross‑thread scenarios, throughput dropped ≈ 3 % and latency increased ≈ 3 %.

These overheads were acceptable, while trace connectivity improved from ~20 % to > 80 %.

Conclusion

The end‑to‑end tracing system—covering technology selection, architecture, data pipeline and concrete operational fixes—demonstrates how metric‑driven analysis and targeted optimisations can resolve high‑volume trace loss, back‑pressure and connectivity problems in a cloud‑native environment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeFlinkAPMJavaAgent
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.