Evolution of Bilibili's Data Retrieval Services and Lakehouse Architecture
Bilibili’s data retrieval journey progressed from a fragmented, chimney‑style pipeline to a unified Flink‑based service layer with the Ark construction system and Akuya SQL engine, and finally to an Iceberg‑driven lakehouse that eliminates data duplication, streamlines cross‑engine optimization, and offers platformized, low‑latency analytics.
In the previous article, Bilibili introduced its Iceberg‑based lakehouse architecture. This article continues the discussion by describing the evolution of Bilibili's data retrieval services, which is a key manifestation of the lakehouse practice.
The Data Platform Department provides a variety of data services (BI analysis, ABTest, user profiling, traffic analysis, etc.) that all rely on massive data extraction. As business grows, the extraction services face three major challenges: increasing demand with limited manpower, repeated construction of Lambda/Kappa‑style big‑data architectures, and high performance‑optimization costs due to the use of multiple engines (Elasticsearch, ClickHouse, HBase, MongoDB).
To address these problems, Bilibili has performed two major architectural upgrades, moving toward service‑oriented and platform‑oriented solutions.
Stone Age – Chimney‑style Development
Early retrieval services followed a four‑stage pipeline: data modeling (ODS/DWD/DWA layers via Hive, Spark, Flink), data storage (TiDB for metrics, ClickHouse for batch details, TaiShan KV for point queries), query interfaces (custom HTTP APIs), and data products (BI platforms, DMP, ABTest, UP‑owner insights). Two roles supported this pipeline: data‑warehouse engineers and application developers. As demand grew, the model suffered from heavy data‑modeling workload, duplicated architecture, and inconsistent data definitions.
Iron Age – Unified Services
During this phase, storage and computation were unified. A data‑construction system (code‑named Ark ) built on Flink ingests both real‑time (Kafka) and batch (Hive HCatalog) data and writes to four unified storage engines:
Elasticsearch for metric data with dynamic columns (millisecond response).
ClickHouse for wide‑table detail data (second‑level response).
TiDB for detail data supporting billion‑scale point queries (millisecond response).
InfluxDB for real‑time metrics (millisecond response).
The data‑construction platform visualizes job configuration and runs on the internal AiFlow scheduler.
For querying, the Akuya SQL Engine (ASE) was built. It adopts a SQL‑DSL based on TiDB Parser and Apache Calcite, providing a unified JDBC driver. The query flow is:
Parse SQL to AST.
Validate metadata via AMS (internal metadata service).
Apply optimization strategies (e.g., parallel execution for multi‑metric queries).
Generate physical plans that translate to engine‑specific queries (e.g., ES DSL, JDBC for TiDB).
Example queries (preserved as code):
select dim1, dim2, pv, uv from business.metric where log_date = '20210310' select dim1, sum(pv) from business.metric where dim1 is not null and dim2 is not null and log_date='20210310' group by dim1 select month(), sum(pv) from business.metric where dim1 is not null and log_date>'20200101' and log_date<='20201231' group by month() select log_date, pv, year_to_year(pv) from business.metric where dim1 is not null and dim2 is not null and log_date >= "20210301" and log_date <= "20210310" select log_date, uv, month_to_month(uv) from business.metric where dim1 is not null and dim2 is not null and log_date >= "20210301" and log_date <= "20210310" select CTR(点击pv, 展现pv) from business.metric where dim1 is not null and dim2 is not null and log_date>'20200101' and log_date<='20201231'UDFs such as month() , year_to_year() , month_to_month() , and CTR are provided for custom calculations.
Mid‑term challenges after unifying storage and query included data duplication (offline data and storage engines both hold copies), high optimization cost for each engine, and limited open tools for data security, DQC, sampling, etc.
Industrial Age – Lakehouse Integration
Starting in 2021, Bilibili introduced a lakehouse architecture to reduce data‑out‑warehouse costs and provide PaaS capabilities. Core capabilities:
HDFS‑based lakehouse where data does not need to be extracted.
Unified metadata management via Iceberg and HCatalog, enabling seamless switching between Hive and Iceberg tables.
Data indexing and ACID transactions on Iceberg tables, delivering second‑level or even millisecond‑level query latency.
Platformized services such as data indexing, ad‑hoc analysis (via Trino), ETL with SQL read/write to Hive/Iceberg, metadata queries, and auto‑generated data APIs.
Sample ETL statement (preserved as code):
insert overwrite table iceberg_rta.dm_growth_dwd_rta_action_search_click_deeplink_l_1d_d
select
search_time,
click_time,
request_id,
click_id,
remote_ip,
platform,
app_id,
account_id,
rta_device,
click_device,
start_device,
source_id
from b_dwm.dm_growth_dwd_rta_action_search_click_deeplink_l_1d_d
where log_date='20220210'
distribute by source_id sort by source_idCurrent challenges focus on further reducing data duplication, simplifying performance tuning across engines, and expanding open tools for security and quality control.
Looking ahead, the lakehouse platform aims to enhance collaborative development, lower the barrier for data generation and consumption, and continue improving the data retrieval experience.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.