How Bilibili Re‑engineered Its Search Indexing Pipeline for Hour‑Level Turnaround
This article details Bilibili's transformation of its search offline indexing architecture—from a manual, low‑throughput MySQL‑centric process to a distributed, KV‑based, protobuf‑driven pipeline that leverages Taishan storage and Spark, cutting build cycles from days to hours while solving performance, consistency, and maintenance challenges.
Background
Bilibili’s search service must index massive video, comment, and other resource types, handling billions of daily queries. The original offline indexing pipeline relied on MySQL for data storage, which became a performance bottleneck as data volume grew.
Challenges
MySQL load spikes during schema changes, bulk imports, and full‑index scans, causing high latency and replication lag.
Separate pipelines for full (base) and incremental (delta) data increased maintenance complexity.
Repeated tokenization and vectorization of unchanged documents wasted CPU cycles.
Inconsistent data freshness between full and delta indexes.
Design Goals
Select a storage layer that provides high capacity, high throughput, low latency, and schema‑evolution friendliness without requiring online DDL.
Storage Selection
After evaluating relational databases, column‑family stores, and key‑value stores, the internal high‑performance distributed KV store Taishan was chosen. Taishan offers ordered key scans, Compare‑And‑Swap (CAS) for optimistic concurrency, and fast bulk export.
Architecture Overview
Taishan serves as a unified table‑style data layer. Each document is a row identified by doc_id; fields are stored in column families (CF). Keys follow the pattern doc_id:cf_name, enabling efficient row‑wise scans and cache‑friendly access.
Key Design
Keys are lexical strings of the form doc_id:cf. Because doc_id is stored as big‑endian int64, lexical order matches numeric order, allowing sequential scans.
Value Design
Values start with a varint length header, followed by optional metadata and the payload, which maximizes storage density.
Serialization Strategy
JSON Lines proved inefficient for large‑scale indexing. The pipeline switched to Protocol Buffers, defining a compact Video message:
message Video {
int64 id = 1;
string title = 2;
string uname = 3;
repeated float doc_embedding = 4;
}Protobuf reduces storage size, speeds up deserialization, and avoids JavaScript max‑safe‑integer issues.
Efficient Merging
Multiple protobuf buffers can be concatenated directly, or merged via MergeFrom, eliminating costly decode‑encode cycles:
Video v;
v.ParseFromString(buf1 + buf2);
// Equivalent to:
Video v1, v2;
v1.ParseFromString(buf1);
v2.ParseFromString(buf2);
v1.MergeFrom(v2);Data Ingestion Layers
T+1 Batch : Periodic full loads from Hive/TSV/CSV files into Taishan.
Real‑time : Immediate writes to Taishan followed by a lightweight change‑event message (e.g., {"id":613621262,"cf_changed":["eb"]}) for downstream processing.
Write‑Conflict Handling
Both full and incremental writes use CAS. A write succeeds only if the key does not exist or the existing value matches the expected precondition; on conflict the operation is retried.
Incremental Computation
Stable intermediate results such as tokenized titles and embeddings are stored in Taishan. They are recomputed only when source fields change, dramatically reducing CPU usage.
Distributed Build with Spark
The index construction was migrated from cron scripts on bare metal to a Spark job running on a Kubernetes cluster. The pipeline consists of:
Read data from Taishan (sst files) via the official JNI library.
Decode Taishan’s custom format.
Re‑partition large files into manageable Spark partitions.
Apply per‑document transformations (filtering, field mapping).
Encode content into FlatBuffer or Protobuf and build forward/inverted indexes.
Compress and package the final index files.
This approach achieves hour‑level build times, higher concurrency, and better resource utilization compared with the previous low‑utilization container deployment.
Results
The new architecture halves the indexing cycle, removes MySQL bottlenecks, simplifies maintenance, and provides a scalable foundation for future feature growth.
References
https://protobuf.dev/programming-guides/encoding/#varints
https://dev.mysql.com/doc/refman/8.4/en/innodb-online-ddl-operations.html
https://research.google/pubs/large-scale-incremental-processing-using-distributed-transactions-and-notifications/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
