Design and Implementation of Trace2.0 Distributed Tracing Platform
Trace2.0 is an OpenTelemetry‑based distributed tracing platform that collects billions of spans daily, routes data through a control plane, OTel Server, and Kafka to ClickHouse hot‑cold storage with tail sampling, achieving 66% cost reduction, 12× compression, sub‑second query latency, and plans to offload raw spans to object storage.
Distributed tracing is a key technique for observability in large‑scale applications. Trace2.0, built on the OpenTelemetry standard, provides a one‑stop platform for end‑to‑end trace collection, analysis, and diagnosis.
The system ingests hundreds of terabytes of trace data daily (billions of spans) and must process and query this data in real time with low cost. The overall architecture consists of client SDKs, a control plane, an OTel Server for data ingestion, and OTel Storage for computation and persistence.
Key components:
Client & Data Collection: Multi‑language OpenTelemetry SDKs generate unified observability data.
Control Plane: Central configuration service pushes dynamic settings to collectors.
OTel Server: Receives data via gRPC/HTTP (OTLP) and forwards it to Kafka.
OTel Storage: Real‑time indexing, span metrics aggregation, business‑order linking, Redis/MySQL hotspot statistics, and writes to ClickHouse hot and cold clusters.
Tail Sampling & Hot‑Cold Storage
To avoid storing all traces, Trace2.0 keeps full traces for 3 days (hot data) and retains only sampled traces for 30 days (cold data) using Kafka‑delayed consumption and a Bloom filter. The Bloom filter is sharded by ten‑minute windows, serialized to ClickHouse, and merged across OTel Server nodes.
Trace IDs embed a hexadecimal timestamp, enabling fast routing of queries to the appropriate hot or cold cluster.
Self‑Built Storage & Cost Reduction
Initially using Alibaba Cloud SLS‑Trace, the team migrated to a ClickHouse‑based solution to cut storage costs by 66% and improve query latency (from >800 ms to ~490 ms). ClickHouse tables store both trace indexes and raw span data, with materialized views aggregating 30‑second metrics into 10‑minute intervals.
-- span_metrics_10m_mv
CREATE MATERIALIZED VIEW IF NOT EXISTS '{database}'.span_metrics_10m_mv_local
on cluster '{cluster}'
TO '{database}'.span_metrics_10m_local
AS
SELECT a.serviceName as serviceName,
a.spanName as spanName,
a.kind as kind,
a.statusCode as statusCode,
toStartOfTenMinutes(a.timeBucket) as timeBucket,
sum(a.count) as count,
sum(a.timeSum) as timeSum,
max(a.timeMax) as timeMax,
min(a.timeMin) as timeMin
FROM '{database}'.span_metrics_30s_local as a
GROUP BY a.serviceName, a.spanName, a.kind, a.statusCode,
toStartOfTenMinutes(a.timeBucket);The span_data table is defined as:
-- span_data
CREATE TABLE IF NOT EXISTS '{database}'.span_data_local ON CLUSTER '{cluster}'
(
traceID FixedString(32),
spanID FixedString(16),
startTime DateTime64(6) Codec (Delta, Default),
body String CODEC (ZSTD(3))
) ENGINE = MergeTree
ORDER BY (traceID,startTime,spanID)
PARTITION BY toStartOfTenMinutes(startTime)
TTL toDate(startTime) + INTERVAL '{TTL}' HOUR;
-- span_data_distributed
CREATE TABLE IF NOT EXISTS '{database}'.span_data_all ON CLUSTER '{cluster}'
as '{database}'.span_data_local
ENGINE = Distributed('{cluster}', '{database}', span_data_local,
xxHash64(concat(traceID,spanID,toString(toDateTime(startTime,6)))));Write flow: discover ClickHouse nodes, hash spans to a node, and batch‑write to the local tables. The system achieves a compression ratio of 12× with ZSTD, writes up to 250k rows/s per node, and reduces storage cost dramatically.
Future Work
To further lower storage pressure, raw span data will be offloaded to object storage (HDFS/OSS) while ClickHouse keeps only offsets.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DeWu Technology
A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
