Industry Insights 24 min read

How Bilibili Re‑engineered Its Search Indexing Pipeline for Hour‑Level Turnaround

This article details Bilibili's transformation of its search offline indexing architecture—from a manual, low‑throughput MySQL‑centric process to a distributed, KV‑based, protobuf‑driven pipeline that leverages Taishan storage and Spark, cutting build cycles from days to hours while solving performance, consistency, and maintenance challenges.

Architect

Sep 24, 2024

How Bilibili Re‑engineered Its Search Indexing Pipeline for Hour‑Level Turnaround

Background

Bilibili’s search service must index massive video, comment, and other resource types, handling billions of daily queries. The original offline indexing pipeline relied on MySQL for data storage, which became a performance bottleneck as data volume grew.

Challenges

MySQL load spikes during schema changes, bulk imports, and full‑index scans, causing high latency and replication lag.

Separate pipelines for full (base) and incremental (delta) data increased maintenance complexity.

Repeated tokenization and vectorization of unchanged documents wasted CPU cycles.

Inconsistent data freshness between full and delta indexes.

Design Goals

Select a storage layer that provides high capacity, high throughput, low latency, and schema‑evolution friendliness without requiring online DDL.

Storage Selection

After evaluating relational databases, column‑family stores, and key‑value stores, the internal high‑performance distributed KV store Taishan was chosen. Taishan offers ordered key scans, Compare‑And‑Swap (CAS) for optimistic concurrency, and fast bulk export.

Architecture Overview

Taishan serves as a unified table‑style data layer. Each document is a row identified by doc_id; fields are stored in column families (CF). Keys follow the pattern doc_id:cf_name, enabling efficient row‑wise scans and cache‑friendly access.

Key Design

Keys are lexical strings of the form doc_id:cf. Because doc_id is stored as big‑endian int64, lexical order matches numeric order, allowing sequential scans.

Value Design

Values start with a varint length header, followed by optional metadata and the payload, which maximizes storage density.

Serialization Strategy

JSON Lines proved inefficient for large‑scale indexing. The pipeline switched to Protocol Buffers, defining a compact Video message:

message Video {
    int64 id = 1;
    string title = 2;
    string uname = 3;
    repeated float doc_embedding = 4;
}

Protobuf reduces storage size, speeds up deserialization, and avoids JavaScript max‑safe‑integer issues.

Efficient Merging

Multiple protobuf buffers can be concatenated directly, or merged via MergeFrom, eliminating costly decode‑encode cycles:

Video v;
v.ParseFromString(buf1 + buf2);
// Equivalent to:
Video v1, v2;
v1.ParseFromString(buf1);
v2.ParseFromString(buf2);
v1.MergeFrom(v2);

Data Ingestion Layers

T+1 Batch : Periodic full loads from Hive/TSV/CSV files into Taishan.

Real‑time : Immediate writes to Taishan followed by a lightweight change‑event message (e.g., {"id":613621262,"cf_changed":["eb"]}) for downstream processing.

Write‑Conflict Handling

Both full and incremental writes use CAS. A write succeeds only if the key does not exist or the existing value matches the expected precondition; on conflict the operation is retried.

Incremental Computation

Stable intermediate results such as tokenized titles and embeddings are stored in Taishan. They are recomputed only when source fields change, dramatically reducing CPU usage.

Distributed Build with Spark

The index construction was migrated from cron scripts on bare metal to a Spark job running on a Kubernetes cluster. The pipeline consists of:

Read data from Taishan (sst files) via the official JNI library.

Decode Taishan’s custom format.

Re‑partition large files into manageable Spark partitions.

Apply per‑document transformations (filtering, field mapping).

Encode content into FlatBuffer or Protobuf and build forward/inverted indexes.

Compress and package the final index files.

This approach achieves hour‑level build times, higher concurrency, and better resource utilization compared with the previous low‑utilization container deployment.

Results

The new architecture halves the indexing cycle, removes MySQL bottlenecks, simplifies maintenance, and provides a scalable foundation for future feature growth.

References

https://protobuf.dev/programming-guides/encoding/#varints

https://dev.mysql.com/doc/refman/8.4/en/innodb-online-ddl-operations.html

https://research.google/pubs/large-scale-incremental-processing-using-distributed-transactions-and-notifications/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Big Data Indexing protobuf storage Spark Search

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.