How Trace2.0 Cuts Tracing Costs by 66% with Tail Sampling and ClickHouse
This article details the design of Trace2.0, a next‑generation distributed tracing platform built on OpenTelemetry, covering its end‑to‑end architecture, tail sampling with hot‑cold storage, Bloom‑filter implementation, and a self‑built ClickHouse storage layer that reduces storage costs by two‑thirds while improving query performance.
Overall Architecture
Trace2.0 is a one‑stop observability platform that collects 100 TB of trace data daily (hundreds of billions of spans) and provides real‑time processing and efficient querying. The system integrates custom OpenTelemetry SDKs on the client side, a unified control plane for dynamic configuration, an OTel Server that ingests data via OTLP (gRPC/HTTP), and an OTel Storage component for analysis and persistence.
Key capabilities of each component:
Client & Data Collection : Multi‑language OpenTelemetry agents generate unified observability data.
Control Plane : Central configuration distributes dynamic flags (e.g., request collection, performance profiling, traffic coloring) and supports gray‑release per instance.
OTel Server : Receives data via OTLP, supporting both gRPC and HTTP.
OTel Storage : Provides real‑time search and scenario‑specific analytics, including trace indexing, span‑metric aggregation, business‑order linking, Redis and MySQL hotspot statistics.
Tail Sampling & Hot‑Cold Storage
Early tracing at the company used a 1 % client‑side sampling rate, causing many traces to be missing during debugging. Trace2.0 therefore collects all traces but stores only valuable ones using a tail‑sampling strategy and a two‑tier storage model.
Valuable traces are defined by four high‑priority scenarios:
Errors (ERROR level) in the call chain.
Database calls exceeding 200 ms.
Total call‑chain latency over 1 s.
Business‑specific chains, e.g., those linked by order IDs.
Storage policy:
Hot data: all traces retained for 3 days.
Cold data: traces that match sampling rules (errors, slow spans, custom rules) are kept for 30 days using Kafka delayed consumption and a Bloom filter.
Processing flow:
OTel Server data collection & sampling rules : All traces are written to Kafka; spans that satisfy the defined rules have their TraceID recorded in a Bloom filter.
Hot data persistence : Kafka streams are consumed in real time and stored in a ClickHouse hot cluster.
Cold data persistence : Bloom‑filter entries are consumed after a delay; matching spans are stored in a ClickHouse cold cluster (30‑minute delay ensures full span retention).
TraceID point query : TraceID encodes the Unix timestamp (hex, 8 bytes). Decoding the timestamp determines whether to query the hot or cold cluster.
Bloom‑filter design details:
OTel Server writes qualifying TraceIDs into time‑bucketed Bloom filters.
Filters are sharded by 10‑minute windows, serialized, and stored in ClickHouse.
OTel Storage merges filters from all servers to reduce memory usage and improve query efficiency.
Self‑Built Storage & Cost Reduction
Initially Trace2.0 used Alibaba Cloud SLS‑Trace, but storage costs grew rapidly. A custom storage solution based on ClickHouse was built, following trends in open‑source tracing projects (SkyWalking, Pinpoint, Uptrace, Signoz) that also favor ClickHouse.
Key storage components:
Trace index & detail data : Span indexes go to SpanIndex table; raw span data to SpanData table (schema similar to Uptrace).
SpanMetrics aggregation : Every 30 seconds, metrics such as call count, total time, max/min latency, and percentiles are computed per service, span name, host, status code, etc., and stored in SpanMetrics.
Down‑sampling : ClickHouse materialized views roll up second‑level metrics to minute‑ and hour‑level aggregates.
Metadata (topology) : Upstream/downstream relationships are written to Nebula graph database.
ClickHouse table creation (simplified):
-- span_data
CREATE TABLE IF NOT EXISTS {database}.span_data_local ON CLUSTER '{cluster}' (
traceID FixedString(32),
spanID FixedString(16),
startTime DateTime64(6) CODEC(Delta, Default),
body String CODEC(ZSTD(3))
) ENGINE = MergeTree
ORDER BY (traceID, startTime, spanID)
PARTITION BY toStartOfTenMinutes(startTime)
TTL toDate(startTime) + INTERVAL '{TTL}' HOUR;
-- span_data_distributed
CREATE TABLE IF NOT EXISTS {database}.span_data_all ON CLUSTER '{cluster}'
AS {database}.span_data_local
ENGINE = Distributed('{cluster}', '{database}', span_data_local,
xxHash64(concat(traceID, spanID, toString(toDateTime(startTime,6)))));Write workflow:
Periodically fetch ClickHouse cluster nodes.
Hash the target node and batch‑write to its local table (avoid using distributed tables for writes).
Results : ClickHouse ZSTD compression achieved a 12× reduction; hot and cold clusters run on dozens of 16C 64G ESSD nodes with a write throughput of 250 k rows/s per node. Compared with SLS‑Trace, storage cost dropped by 66 % and query latency improved from >800 ms to ~490 ms.
Future plans : Move raw span data to block storage (HDFS/OSS) and store only offsets in ClickHouse to further cut storage usage.
References:
SLS‑Trace solution: https://developer.aliyun.com/article/785854
OpenTelemetry collector exporter: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/alibabacloudlogserviceexporter
Uptrace project: https://uptrace.dev/
Signoz columns: https://signoz.io/
Uptrace schema design: https://github.com/uptrace/uptrace/tree/v0.2.16/pkg/bunapp/migrations
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
