Big Data 14 min read

Evolution of Bilibili's Data Retrieval Services and Lakehouse Architecture

Bilibili’s data retrieval journey progressed from a fragmented, chimney‑style pipeline to a unified Flink‑based service layer with the Ark construction system and Akuya SQL engine, and finally to an Iceberg‑driven lakehouse that eliminates data duplication, streamlines cross‑engine optimization, and offers platformized, low‑latency analytics.

Bilibili Tech

Feb 18, 2022

Evolution of Bilibili's Data Retrieval Services and Lakehouse Architecture

In the previous article, Bilibili introduced its Iceberg‑based lakehouse architecture. This article continues the discussion by describing the evolution of Bilibili's data retrieval services, which is a key manifestation of the lakehouse practice.

The Data Platform Department provides a variety of data services (BI analysis, ABTest, user profiling, traffic analysis, etc.) that all rely on massive data extraction. As business grows, the extraction services face three major challenges: increasing demand with limited manpower, repeated construction of Lambda/Kappa‑style big‑data architectures, and high performance‑optimization costs due to the use of multiple engines (Elasticsearch, ClickHouse, HBase, MongoDB).

To address these problems, Bilibili has performed two major architectural upgrades, moving toward service‑oriented and platform‑oriented solutions.

Stone Age – Chimney‑style Development

Early retrieval services followed a four‑stage pipeline: data modeling (ODS/DWD/DWA layers via Hive, Spark, Flink), data storage (TiDB for metrics, ClickHouse for batch details, TaiShan KV for point queries), query interfaces (custom HTTP APIs), and data products (BI platforms, DMP, ABTest, UP‑owner insights). Two roles supported this pipeline: data‑warehouse engineers and application developers. As demand grew, the model suffered from heavy data‑modeling workload, duplicated architecture, and inconsistent data definitions.

Iron Age – Unified Services

During this phase, storage and computation were unified. A data‑construction system (code‑named Ark) built on Flink ingests both real‑time (Kafka) and batch (Hive HCatalog) data and writes to four unified storage engines:

Elasticsearch for metric data with dynamic columns (millisecond response).

ClickHouse for wide‑table detail data (second‑level response).

TiDB for detail data supporting billion‑scale point queries (millisecond response).

InfluxDB for real‑time metrics (millisecond response).

The data‑construction platform visualizes job configuration and runs on the internal AiFlow scheduler.

For querying, the Akuya SQL Engine (ASE) was built. It adopts a SQL‑DSL based on TiDB Parser and Apache Calcite, providing a unified JDBC driver. The query flow is:

Parse SQL to AST.

Validate metadata via AMS (internal metadata service).

Apply optimization strategies (e.g., parallel execution for multi‑metric queries).

Generate physical plans that translate to engine‑specific queries (e.g., ES DSL, JDBC for TiDB).

Example queries (preserved as code):

select dim1, dim2, pv, uv from business.metric where log_date = '20210310'

select dim1, sum(pv) from business.metric where dim1 is not null and dim2 is not null and log_date='20210310' group by dim1

select month(), sum(pv) from business.metric where dim1 is not null and log_date>'20200101' and log_date<='20201231' group by month()

select log_date, pv, year_to_year(pv) from business.metric where dim1 is not null and dim2 is not null and log_date >= "20210301" and log_date <= "20210310"

select log_date, uv, month_to_month(uv) from business.metric where dim1 is not null and dim2 is not null and log_date >= "20210301" and log_date <= "20210310"

select CTR(点击pv, 展现pv) from business.metric where dim1 is not null and dim2 is not null and log_date>'20200101' and log_date<='20201231'

UDFs such as month(), year_to_year(), month_to_month(), and CTR are provided for custom calculations.

Mid‑term challenges after unifying storage and query included data duplication (offline data and storage engines both hold copies), high optimization cost for each engine, and limited open tools for data security, DQC, sampling, etc.

Industrial Age – Lakehouse Integration

Starting in 2021, Bilibili introduced a lakehouse architecture to reduce data‑out‑warehouse costs and provide PaaS capabilities. Core capabilities:

HDFS‑based lakehouse where data does not need to be extracted.

Unified metadata management via Iceberg and HCatalog, enabling seamless switching between Hive and Iceberg tables.

Data indexing and ACID transactions on Iceberg tables, delivering second‑level or even millisecond‑level query latency.

Platformized services such as data indexing, ad‑hoc analysis (via Trino), ETL with SQL read/write to Hive/Iceberg, metadata queries, and auto‑generated data APIs.

Sample ETL statement (preserved as code):

insert overwrite table iceberg_rta.dm_growth_dwd_rta_action_search_click_deeplink_l_1d_d
select
  search_time,
  click_time,
  request_id,
  click_id,
  remote_ip,
  platform,
  app_id,
  account_id,
  rta_device,
  click_device,
  start_device,
  source_id
from b_dwm.dm_growth_dwd_rta_action_search_click_deeplink_l_1d_d
where log_date='20220210'
distribute by source_id sort by source_id

Current challenges focus on further reducing data duplication, simplifying performance tuning across engines, and expanding open tools for security and quality control.

Looking ahead, the lakehouse platform aims to enhance collaborative development, lower the barrier for data generation and consumption, and continue improving the data retrieval experience.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SQL Engine Bilibili Lakehouse Data Retrieval

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.