How Bilibili Scaled Its OLAP Platform with ClickHouse and Lakehouse Integration
At Bilibili, the OLAP platform evolved through three phases—consolidating data services onto ClickHouse, migrating text search to ClickHouse, and integrating a lake‑house architecture—delivering massive cost reductions, sub‑second query latency, and scalable analytics for billions of daily events.
Background
Two years ago Bilibili operated many independent OLAP engines (Apache Kylin, Elasticsearch, Druid, ClickHouse, Presto, etc.), resulting in high maintenance cost and unstable performance.
Phase 1 – Consolidation to ClickHouse
All data‑service workloads were migrated to a ClickHouse cluster, retiring Kylin and Druid. ClickHouse was chosen for its native file‑system storage, vectorized execution engine, rich built‑in functions, MergeTree engines, materialized views and indexes.
Typical workloads include user‑behavior analysis, tag selection and content analysis, processing billions of events daily. After migration a 64‑node ClickHouse cluster (≈5 PB) achieves P90 query latency 4 seconds , compared with 10‑30 minutes on Spark.
To mitigate write amplification and MergeTree overhead, a Spark‑ClickHouse bulk‑load pipeline was built. Spark tasks generate ClickHouse data files locally; a two‑phase commit uploads the files to the ClickHouse cluster, moving most I/O off the OLAP nodes.
Phase 2 – Text Search Migration to ClickHouse
Log‑based text search and search‑ranking workloads were moved from Elasticsearch to ClickHouse. For log search, ClickHouse primary‑key range pruning and token‑bloom‑filter indexes on the message field provide:
≈10× write speedup
≈⅓ storage cost
≈2× query speedup (P90 ≈ 3 s)
Search‑ranking still uses Elasticsearch for scoring, while exact‑match log queries run on ClickHouse.
Key techniques:
Primary‑key range pruning to limit scanned partitions.
Token‑bloom‑filter index on message for lightweight full‑text filtering.
On‑the‑fly computation for aggregations (bitmap, set operations, etc.).
Phase 3 – Lake‑House Integration (Iceberg + Trino)
A lake‑house stack centered on Apache Iceberg was introduced. Spark and Flink write to Iceberg tables; a custom service named Magnus receives write events and triggers asynchronous optimizations (small‑file merging, data sorting, index creation). Trino serves queries on Iceberg, while Alluxio caches Iceberg metadata and indexes.
Benefits:
ACID guarantees and near‑real‑time visibility for historical logs.
Lower storage cost compared with ClickHouse for long‑term data.
Metric service processes ~200 k queries per day with P90 ≈ 1.2 s.
Engine Selection Guidance
Three engines coexist:
Elasticsearch – used for search‑ranking where relevance scoring is required.
ClickHouse – preferred for sub‑second interactive analytical queries (e.g., user‑behavior analysis, log search).
Lake‑house (Iceberg + Trino) – chosen for cost‑effective, near‑real‑time analytics where sub‑second latency is not mandatory (e.g., metric service, long‑term log storage).
Selection is based on business type and latency requirements.
Conclusions
ClickHouse provides a powerful, versatile OLAP engine covering most interactive analytical scenarios with low latency and reduced cost. The lake‑house complements ClickHouse by offering cheaper storage, ACID guarantees, and real‑time visibility for offline‑oriented workloads. Future work aims to unify compute layers, following trends in StarRocks and Doris.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
