How Bilibili Re‑engineered Its Search Indexing with Distributed Storage and Spark
This article details Bilibili's transformation of its search offline indexing pipeline, moving from manual MySQL‑based processes to a high‑capacity, distributed KV store and Spark‑driven builds, addressing performance, maintenance, and scalability challenges while improving resource efficiency and iteration speed.
Background
Bilibili's search service must index massive amounts of videos, comments, and other media, handling billions of daily queries. The original offline indexing workflow relied on manual, MySQL‑centric processes that became increasingly complex and resource‑intensive as data sources multiplied.
Index Data Production Process
Both full (base) and incremental (delta) indexes are generated. Full indexes are produced daily as files containing all data up to a specific point, while incremental indexes are streamed as messages, each representing a complete document record.
Multiple third‑party data sources (e.g., MySQL binlog, exported snapshots, Hive tables) are merged with the internal video MySQL before indexing, leading to a tangled and hard‑to‑maintain pipeline.
Challenges
Performance: Growing MySQL load from schema changes, bulk imports, and full‑index scans caused bottlenecks and replication lag.
Maintenance Cost: Diverse data origins required bespoke stitching logic, making iteration slow and error‑prone.
Resource Consumption: Re‑tokenizing and rebuilding indexes for unchanged data wasted CPU and storage.
Design Goals
Unify all index‑related data into a single, scalable storage layer that supports high capacity, high throughput, low latency, and flexible schema evolution without online DDL.
Storage Selection
After evaluating relational databases, column‑family stores, and KV systems, the team chose Bilibili's internal distributed KV store Taishan as the foundation, wrapping it with a row‑oriented table model.
Taishan offers ordered keys, Compare‑And‑Swap (CAS) for optimistic locking, and efficient bulk export to object storage.
Architecture Design
On top of Taishan, a data storage layer (table model) and a unified import/export layer were built, decoupling raw sources from indexing logic.
Both offline and near‑line jobs now read from the same unified source, eliminating inconsistencies between full and incremental pipelines.
Data Storage Layer
The table model maps each document to a row identified by its ID. Columns (column families) store related fields, and cells correspond to KV entries.
Key design uses "ID:CF" format, enabling sequential row scans with good cache locality. Values start with a varint header followed by metadata and the actual payload.
Serialization
JSON Lines proved inefficient in space and parsing speed. The team switched to Protocol Buffers, defining a
message Video { int64 id = 1; string title = 2; string uname = 3; repeated float doc_embedding = 4; }schema, gaining compact storage, faster deserialization, and type safety.
Protobuf buffers can be concatenated directly for merging, avoiding costly decode‑encode cycles.
Change Data Stream
Taishan can export binlog entries, but raw binlogs generate excessive traffic and lack full document values. A custom write layer emits lightweight change messages like {"id":613621262,"cf_changed":["eb"]}, allowing downstream consumers to fetch the latest full record from the primary node.
Data Ingestion Layer
All source data (full, incremental, T+1 snapshots) are written into Taishan tables. T+1 data are batch‑loaded, while real‑time streams write first to Taishan then emit change messages, preserving ordering and consistency.
Write conflicts are resolved using CAS: full‑vs‑incremental writes require the key to be absent, and concurrent incremental writes require the existing value to match before updating, with retries on failure.
Incremental Computation
Stable intermediate results such as tokenized titles and embeddings are stored in Taishan. They are recomputed only when the source fields change, dramatically reducing redundant processing.
Data Export Layer
Export jobs read from Taishan, selecting configured columns for full or incremental index builds. Full exports are performed daily by scanning Taishan backups (minutes‑scale), while incremental exports consume the change stream and fetch missing fields on demand.
Distributed Build
The indexing pipeline migrated from cron‑based scripts on physical machines to Spark jobs running on a Kubernetes cluster. The Spark workflow includes reading Taishan SST files, decoding, partitioning large files, applying per‑document transformations, encoding to FlatBuffers/Protobuf, building forward and inverted indexes, and finally compressing the results.
This approach boosts concurrency, utilizes resources efficiently, and reduces the end‑to‑end build time from days to hours.
Conclusion & Outlook
By consolidating data in a distributed KV store, adopting protobuf serialization, and leveraging Spark for parallel builds, Bilibili cut its index construction cycle by more than half, eliminated major performance bottlenecks, and created a flexible foundation for future scaling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
