Design and Optimization of Bilibili Log Service 2.0 Using ClickHouse and OpenTelemetry
This article details Bilibili's evolution of its log system from an Elastic Stack‑based solution to a ClickHouse‑backed architecture with OpenTelemetry, describing the challenges of cost, stability, and scalability, the new components such as Log‑Agent, Log‑Ingester, and a custom visualization platform, and the performance gains and future directions.
Background
Bilibili has operated an Elastic Stack‑based log system (Billions) for over five years, handling more than 500 machines and ingesting over 700 TB of logs daily, but faced cost, stability, and scalability limits as the service grew.
Problems
High write‑throughput bottlenecks and CPU‑heavy tokenization in Elasticsearch caused latency and stability issues.
Storage cost and memory pressure required aggressive sampling and index closing.
Dynamic mapping, lifecycle management, and Kibana maintenance added operational overhead.
Custom JSON SDKs in Java and Go had performance and compatibility drawbacks.
New Architecture (Log Service 2.0)
The redesign replaces Elasticsearch with ClickHouse for storage, introduces a self‑built visualization platform, and adopts OpenTelemetry as a unified logging protocol.
Key Components
OTEL Logging SDK : High‑performance structured‑logging SDK for Go and Java implementing the OpenTelemetry logging model.
Log‑Agent : Deployed as an agent on physical hosts, receives OTEL logs via domain socket, performs low‑latency file collection and basic parsing.
Log‑Ingester : Consumes logs from Kafka, partitions them by time and metadata, and batches writes into ClickHouse.
ClickHouse : Columnar storage with high compression (ZSTD) and implicit‑column support for dynamic schema, delivering 10× write throughput and 2× query speed compared to Elasticsearch.
Log‑Query : Handles routing, load‑balancing, caching, rate‑limiting, and simplifies query syntax for end users.
BLS‑Discovery : Custom visualization platform offering Kibana‑like UI with zero learning curve.
Performance Highlights
ClickHouse achieved ten‑fold write throughput and reduced storage cost to one‑third of the previous system; structured‑field queries are twice as fast, with 99 % of queries completing within three seconds.
Query Gateway
The gateway abstracts underlying ClickHouse details, providing a simplified SQL‑like API, automatic query rewriting, and Luence‑to‑SQL translation, enabling seamless future storage engine swaps.
Visualization Platform
The self‑built UI replicates Kibana ergonomics while adding features such as query auto‑completion, syntax highlighting, and instant SQL aggregation.
Log Alerting
Alert rules are defined with attributes like data source, time window, aggregation function, filter expression, and notification channel; over 5,000 alerts have been migrated from the ES‑based system.
OpenTelemetry Logging
OpenTelemetry provides a stable, language‑agnostic logging model covering metrics, logs, and traces; Bilibili implements Go and Java SDKs and integrates the protocol into Log‑Agent.
ClickHouse Map Optimizations
To handle dynamic schema, ClickHouse Map fields are stored as implicit columns (MapV2), enabling per‑key indexing, bloom‑filter support, and significant query speedups. Example DDL: CREATE TABLE bloom_filter_map (id UInt32, map Map(String, String), INDEX map_index map TYPE tokenbf_v1(128,3,0) GRANULARITY 1) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 2;
Future Work
Log pattern extraction for better compression and post‑processing.
Integration with lake‑house architectures for long‑term, low‑cost storage and advanced analytics.
Full‑text search capabilities in ClickHouse to close the gap with Elasticsearch.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.