Operations 12 min read

Design and Implementation of Trace2.0 Distributed Tracing Platform

Trace2.0 is an OpenTelemetry‑based distributed tracing platform that collects billions of spans daily, routes data through a control plane, OTel Server, and Kafka to ClickHouse hot‑cold storage with tail sampling, achieving 66% cost reduction, 12× compression, sub‑second query latency, and plans to offload raw spans to object storage.

DeWu Technology

Sep 2, 2022

Design and Implementation of Trace2.0 Distributed Tracing Platform

Distributed tracing is a key technique for observability in large‑scale applications. Trace2.0, built on the OpenTelemetry standard, provides a one‑stop platform for end‑to‑end trace collection, analysis, and diagnosis.

The system ingests hundreds of terabytes of trace data daily (billions of spans) and must process and query this data in real time with low cost. The overall architecture consists of client SDKs, a control plane, an OTel Server for data ingestion, and OTel Storage for computation and persistence.

Key components:

Client & Data Collection: Multi‑language OpenTelemetry SDKs generate unified observability data.

Control Plane: Central configuration service pushes dynamic settings to collectors.

OTel Server: Receives data via gRPC/HTTP (OTLP) and forwards it to Kafka.

OTel Storage: Real‑time indexing, span metrics aggregation, business‑order linking, Redis/MySQL hotspot statistics, and writes to ClickHouse hot and cold clusters.

Tail Sampling & Hot‑Cold Storage

To avoid storing all traces, Trace2.0 keeps full traces for 3 days (hot data) and retains only sampled traces for 30 days (cold data) using Kafka‑delayed consumption and a Bloom filter. The Bloom filter is sharded by ten‑minute windows, serialized to ClickHouse, and merged across OTel Server nodes.

Trace IDs embed a hexadecimal timestamp, enabling fast routing of queries to the appropriate hot or cold cluster.

Self‑Built Storage & Cost Reduction

Initially using Alibaba Cloud SLS‑Trace, the team migrated to a ClickHouse‑based solution to cut storage costs by 66% and improve query latency (from >800 ms to ~490 ms). ClickHouse tables store both trace indexes and raw span data, with materialized views aggregating 30‑second metrics into 10‑minute intervals.

-- span_metrics_10m_mv
CREATE MATERIALIZED VIEW IF NOT EXISTS '{database}'.span_metrics_10m_mv_local
    on cluster '{cluster}'
    TO '{database}'.span_metrics_10m_local
AS
SELECT a.serviceName                     as serviceName,
       a.spanName                        as spanName,
       a.kind                            as kind,
       a.statusCode                      as statusCode,
       toStartOfTenMinutes(a.timeBucket) as timeBucket,
       sum(a.count)                      as count,
       sum(a.timeSum)                    as timeSum,
       max(a.timeMax)                    as timeMax,
       min(a.timeMin)                    as timeMin
FROM '{database}'.span_metrics_30s_local as a
GROUP BY a.serviceName, a.spanName, a.kind, a.statusCode,
    toStartOfTenMinutes(a.timeBucket);

The span_data table is defined as:

-- span_data
CREATE TABLE IF NOT EXISTS '{database}'.span_data_local ON CLUSTER '{cluster}'
(
    traceID   FixedString(32),
    spanID    FixedString(16),
    startTime DateTime64(6) Codec (Delta, Default),
    body      String CODEC (ZSTD(3))
) ENGINE = MergeTree
ORDER BY (traceID,startTime,spanID)
PARTITION BY toStartOfTenMinutes(startTime)
TTL toDate(startTime) + INTERVAL '{TTL}' HOUR;

-- span_data_distributed
CREATE TABLE IF NOT EXISTS '{database}'.span_data_all ON CLUSTER '{cluster}'
as '{database}'.span_data_local
    ENGINE = Distributed('{cluster}', '{database}', span_data_local,
                         xxHash64(concat(traceID,spanID,toString(toDateTime(startTime,6)))));

Write flow: discover ClickHouse nodes, hash spans to a node, and batch‑write to the local tables. The system achieves a compression ratio of 12× with ZSTD, writes up to 250k rows/s per node, and reduces storage cost dramatically.

Future Work

To further lower storage pressure, raw span data will be offloaded to object storage (HDFS/OSS) while ClickHouse keeps only offsets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend-architecture Observability OpenTelemetry ClickHouse Distributed Tracing

Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.