How Bilibili Scaled Its Search Index with Distributed KV Storage and Spark
Bilibili transformed its search indexing pipeline by replacing a manual, low‑throughput process with a distributed KV store (Taishan) and Spark‑based construction, achieving unified data ingestion, reduced resource consumption, faster full‑ and incremental builds, and a shift from daily to hourly indexing cycles.
Background
Bilibili’s search service must index massive amounts of videos, comments, articles, and other resources, handling billions of daily queries. The original offline indexing relied on a long, manual, and resource‑intensive workflow that combined data from MySQL, Hive, and various third‑party sources.
Data Sources and Indexing Workflow
Three primary data sources feed the video index:
Data A – third‑party database with binlog.
Data B – exported full snapshots plus API and streaming data.
Data C – Hive tables.
Both full (base) and incremental (delta) indexes are produced: full indexes are generated daily as files, while incremental indexes are streamed as messages containing the complete fields of each changed document.
Challenges
Performance : Growing data volume caused MySQL load spikes during schema changes, bulk imports, and full‑index scans, leading to replication lag and delayed index updates.
Maintenance Cost : Diverse data sources required complex, duplicated ingestion logic; full and incremental pipelines diverged, making debugging and iteration difficult.
Resource Consumption : Re‑tokenizing and rebuilding indexes for unchanged data wasted CPU and storage.
Design Goals
Consolidate all index‑related data into a single, high‑capacity storage layer.
Eliminate performance bottlenecks and simplify the ingestion pipeline.
Cache stable intermediate results (e.g., tokenization, embeddings) to avoid recomputation.
Storage Selection – Taishan KV
Taishan is Bilibili’s internal distributed KV store offering horizontal scalability, ordered keys, CAS (compare‑and‑swap) for optimistic locking, and efficient bulk export to object storage.
Architecture Overview
On top of Taishan, a table‑model abstraction is built:
Row : one video identified by its ID.
Column (Column Family) : groups related fields (e.g., title, uname).
Cell : a KV entry representing a field value.
Key format is videoID:columnFamily, enabling sequential scans by ID and high cache locality.
Key and Value Design
Keys are ordered strings (big‑endian int64 IDs) so scans follow numeric order. Values start with a varint header indicating the length of metadata, followed by the actual payload.
Serialization – Protobuf over JSON
Original pipelines stored raw data as JSON Lines, which incurred high storage overhead and slow parsing. Switching to Protocol Buffers reduced size, improved deserialization speed, and provided type safety.
message Video {
int64 id = 1;
string title = 2;
string uname = 3;
repeated float doc_embedding = 4;
}Protobuf buffers can be concatenated directly for efficient merging, avoiding full deserialization.
Data Ingestion Layers
All source data (full snapshots, incremental streams, T+1 updates) are written into Taishan tables. T+1 data is batch‑loaded, while real‑time streams write first to Taishan then emit a lightweight change message (e.g., {"id":613621262,"cf_changed":["eb"]}) for downstream consumers.
Conflict Handling
Write‑write conflicts are resolved using CAS operations. For full‑vs‑incremental writes, CAS ensures that a new value is written only if the cell is empty. For concurrent incremental writers, CAS checks that the existing value matches the expected old value before committing; otherwise the operation retries.
Incremental Computation
Stable intermediate results such as tokenized titles and embeddings are stored in Taishan. When a video’s attributes change, only the affected computations are re‑run, and the new results are written back using CAS to avoid race conditions.
Data Export Layer
Taishan periodically backs up full data to object storage, allowing fast full‑index extraction by scanning rows and selecting configured columns. Incremental indexes are built by consuming the change stream, looking up the full record in Taishan, and emitting the delta.
Distributed Build with Spark
The indexing job was migrated from cron scripts on bare metal to a Spark‑based pipeline running on a Kubernetes cluster. The workflow includes:
Read exported Taishan SST files via a JNI wrapper.
Decode the custom format.
Re‑partition large files into manageable Spark partitions.
Apply per‑video processing (filtering, field mapping).
Encode data into FlatBuffer/Protobuf and build the index.
Compress and package the final index files.
This approach dramatically increased concurrency, reduced build time from days to hours, and improved resource utilization.
Results and Outlook
By unifying storage, adopting protobuf, and leveraging Spark, Bilibili cut the indexing cycle by more than half, eliminated major performance bottlenecks, and simplified future feature iteration. The new architecture positions the search service for continued scaling as data volume grows.
References
https://protobuf.dev/programming-guides/encoding/#varints
https://dev.mysql.com/doc/refman/8.4/en/innodb-online-ddl-operations.html
https://research.google/pubs/large-scale-incremental-processing-using-distributed-transactions-and-notifications/
https://protobuf.dev/programming-guides/encoding/#last-one-wins
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Tech Hub
Sharing cutting-edge internet technologies and practical AI resources.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
