Big Data 25 min read

How Bilibili Scaled Its Search Index with Distributed KV Storage and Spark

Bilibili transformed its search indexing pipeline by replacing a manual, low‑throughput process with a distributed KV store (Taishan) and Spark‑based construction, achieving unified data ingestion, reduced resource consumption, faster full‑ and incremental builds, and a shift from daily to hourly indexing cycles.

Open Source Tech Hub

Oct 31, 2024

How Bilibili Scaled Its Search Index with Distributed KV Storage and Spark

Background

Bilibili’s search service must index massive amounts of videos, comments, articles, and other resources, handling billions of daily queries. The original offline indexing relied on a long, manual, and resource‑intensive workflow that combined data from MySQL, Hive, and various third‑party sources.

Data Sources and Indexing Workflow

Three primary data sources feed the video index:

Data A – third‑party database with binlog.

Data B – exported full snapshots plus API and streaming data.

Data C – Hive tables.

Both full (base) and incremental (delta) indexes are produced: full indexes are generated daily as files, while incremental indexes are streamed as messages containing the complete fields of each changed document.

Challenges

Performance : Growing data volume caused MySQL load spikes during schema changes, bulk imports, and full‑index scans, leading to replication lag and delayed index updates.

Maintenance Cost : Diverse data sources required complex, duplicated ingestion logic; full and incremental pipelines diverged, making debugging and iteration difficult.

Resource Consumption : Re‑tokenizing and rebuilding indexes for unchanged data wasted CPU and storage.

Design Goals

Consolidate all index‑related data into a single, high‑capacity storage layer.

Eliminate performance bottlenecks and simplify the ingestion pipeline.

Cache stable intermediate results (e.g., tokenization, embeddings) to avoid recomputation.

Storage Selection – Taishan KV

Taishan is Bilibili’s internal distributed KV store offering horizontal scalability, ordered keys, CAS (compare‑and‑swap) for optimistic locking, and efficient bulk export to object storage.

Architecture Overview

On top of Taishan, a table‑model abstraction is built:

Row : one video identified by its ID.

Column (Column Family) : groups related fields (e.g., title, uname).

Cell : a KV entry representing a field value.

Key format is videoID:columnFamily, enabling sequential scans by ID and high cache locality.

Key and Value Design

Keys are ordered strings (big‑endian int64 IDs) so scans follow numeric order. Values start with a varint header indicating the length of metadata, followed by the actual payload.

Serialization – Protobuf over JSON

Original pipelines stored raw data as JSON Lines, which incurred high storage overhead and slow parsing. Switching to Protocol Buffers reduced size, improved deserialization speed, and provided type safety.

message Video {
  int64 id = 1;
  string title = 2;
  string uname = 3;
  repeated float doc_embedding = 4;
}

Protobuf buffers can be concatenated directly for efficient merging, avoiding full deserialization.

Data Ingestion Layers

All source data (full snapshots, incremental streams, T+1 updates) are written into Taishan tables. T+1 data is batch‑loaded, while real‑time streams write first to Taishan then emit a lightweight change message (e.g., {"id":613621262,"cf_changed":["eb"]}) for downstream consumers.

Conflict Handling

Write‑write conflicts are resolved using CAS operations. For full‑vs‑incremental writes, CAS ensures that a new value is written only if the cell is empty. For concurrent incremental writers, CAS checks that the existing value matches the expected old value before committing; otherwise the operation retries.

Incremental Computation

Stable intermediate results such as tokenized titles and embeddings are stored in Taishan. When a video’s attributes change, only the affected computations are re‑run, and the new results are written back using CAS to avoid race conditions.

Data Export Layer

Taishan periodically backs up full data to object storage, allowing fast full‑index extraction by scanning rows and selecting configured columns. Incremental indexes are built by consuming the change stream, looking up the full record in Taishan, and emitting the delta.

Distributed Build with Spark

The indexing job was migrated from cron scripts on bare metal to a Spark‑based pipeline running on a Kubernetes cluster. The workflow includes:

Read exported Taishan SST files via a JNI wrapper.

Decode the custom format.

Re‑partition large files into manageable Spark partitions.

Apply per‑video processing (filtering, field mapping).

Encode data into FlatBuffer/Protobuf and build the index.

Compress and package the final index files.

This approach dramatically increased concurrency, reduced build time from days to hours, and improved resource utilization.

Results and Outlook

By unifying storage, adopting protobuf, and leveraging Spark, Bilibili cut the indexing cycle by more than half, eliminated major performance bottlenecks, and simplified future feature iteration. The new architecture positions the search service for continued scaling as data volume grows.

References

https://protobuf.dev/programming-guides/encoding/#varints

https://dev.mysql.com/doc/refman/8.4/en/innodb-online-ddl-operations.html

https://research.google/pubs/large-scale-incremental-processing-using-distributed-transactions-and-notifications/

https://protobuf.dev/programming-guides/encoding/#last-one-wins

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Indexing protobuf Distributed storage Spark Search KV store

Written by

Open Source Tech Hub

Sharing cutting-edge internet technologies and practical AI resources.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.