Operations 18 min read

How Baidu Scales Content Understanding to Trillion‑Scale: Architecture, Optimization, and Scheduling Insights

This article details Baidu Search's engineering practice for trillion‑scale content understanding, covering cost and efficiency challenges, model‑service framework, batch‑compute platform, resource‑scheduling system, HTAP storage design, and concrete optimization techniques such as multi‑process Python serving, dynamic batching, and two‑stage scheduling.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
How Baidu Scales Content Understanding to Trillion‑Scale: Architecture, Optimization, and Scheduling Insights

Business Background

Baidu indexes massive web content and must extract semantic, quality, and safety signals at trillion‑scale. The rapid growth of image and video data and the adoption of large models dramatically increase compute demand, creating severe cost and efficiency challenges.

Key Ideas

Cost optimization : Expand the compute pool by leveraging idle offline resources and an elastic scheduling system; improve per‑request cost through model‑level optimizations, GPU‑specific tuning, and Baidu’s Kunlun chips.

Efficiency optimization : Separate real‑time and offline pipelines; accelerate feature updates with a unified batch‑compute platform; streamline model engineering via a unified service framework.

Technical Solution

Overall Architecture

The system consists of four core components: Model Service Framework, Model Service Platform, Compute Scheduling System, and Batch Compute Platform.

Model Service Framework

Python is used for rapid development, but the Global Interpreter Lock (GIL) limits CPU parallelism. The framework mitigates this by combining multi‑process execution, asynchronous coroutines, and explicit CPU/GPU separation across three process types: RPC, DAG, and Model processes.

RPC process uses Baidu‑customized BRPC ( https://github.com/apache/brpc) with multi‑process and coroutine support, achieving >5× performance improvement.

DAG process executes directed‑acyclic‑graph (DAG) tasks asynchronously, fully utilizing CPU cores.

Model process performs GPU inference, supporting PyTorch, PaddlePaddle, and other engines. Tensor data are transferred via shared memory to avoid copies.

Inference Optimizations

Dynamic batching and multi‑Stream execution increase GPU throughput.

Poros inference engine (

https://github.com/PaddlePaddle/FastDeploy/tree/develop/poros

) integrates TorchScript, graph optimization, TensorRT, and vLLM, enabling high‑performance inference with minimal code changes.

Model quantization (FP16/INT8/INT4) and compression (distillation, pruning) further boost throughput.

Compute Scheduling System

All requests pass through a unified FeatureGateway that performs flow control and routing. Offline jobs are submitted to SmartScheduler , which discovers idle heterogeneous resources across Baidu’s PaaS, deploys operators (算子), and updates gateway policies.

Key challenges include operator capacity bottlenecks, traffic distribution shifts, heterogeneous hardware allocation, and priority handling. The solution adopts a two‑stage scheduler:

Traffic scheduling normalizes QPS demands, assigns operator capacity based on priority, and updates gateway routing.

Resource scheduling prepares capacity gaps, fits hardware queues by cost‑effectiveness, pre‑assigns sub‑services, and fine‑tunes allocations to avoid waste.

Batch Compute Platform

To overcome Scan throughput limits on Baidu’s Table storage, the platform separates OLTP and OLAP workloads and builds an HTAP storage layer.

Dedicated OLAP store reduces read‑write interference.

HTAP store built on RocksDB and Baidu‑style AFS uses incremental snapshots and column‑wise layout for hot fields.

HTAP SDK provides a unified API for accessing both Table and OLAP stores.

Task generation supports three modes:

Configuration‑driven : High‑level UI configuration for common tasks.

KQL : A custom SQL‑like language that allows user‑defined functions.

Offline framework : Full‑featured framework where users implement custom logic and submit a deployment package.

All tasks are ultimately executed on the offline framework; KQL queries are parsed into executable jobs, and framework‑based tasks undergo DevOps gating before deployment.

Code Example

Function classify = {
    def classify(cbytes, ids):
        unique_ids = set(ids)
        classify = int.from_bytes(cbytes, byteorder='little', signed=False)
        while classify != 0:
            tmp = classify & 0xFF
            if tmp in unique_ids:
                return True
            classify = classify >> 8
        return False
}
declare ids = [2, 8];
select * from my_table
convert by json outlet by row filter by function@classify(@cf0:types, @ids);

Conclusion

The system now serves dozens of search scenarios (image, video, etc.), supports hundreds of operators, and processes hundreds of billions of compute calls daily, enabling trillion‑scale content updates. With the rise of large AI models, Baidu plans to further explore model‑driven innovations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataHTAPresource schedulingModel Servingcontent understandingBaidu
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.