Artificial Intelligence 18 min read

How Baidu Scales Content Understanding to Trillion‑Level Data: Architecture, Cost & Efficiency Strategies

Baidu processes trillions of web items by building a deep‑content‑understanding pipeline that tackles massive compute cost and latency through elastic resource pooling, Python‑based model‑service frameworks, multi‑stage scheduling, HTAP storage, and batch‑compute optimizations, enabling real‑time and offline AI services at web scale.

Architect

Dec 21, 2023

Background

Baidu indexes the entire Internet and must extract multi‑dimensional signals—semantic, quality, safety—from a trillion‑scale corpus to support filtering, semantic indexing, and downstream services. The primary challenges are the enormous compute cost (driven by massive data volume, rising multimedia content, and large‑model inference) and the need for high engineering efficiency.

Cost and Efficiency Strategies

Cost optimization expands the compute pool through procurement and elastic scheduling, and reduces unit cost by improving service performance (model structure tuning, GPU/CPU optimizations, Baidu’s Kunlun chip).

Efficiency optimization accelerates model engineering and offline task throughput via a unified model‑service framework, a model‑service platform, a compute‑scheduling system, and a batch‑compute platform.

Model Service Framework

Algorithms are packaged with a Python‑based framework. To bypass the Global Interpreter Lock, Baidu uses a multi‑process + asynchronous coroutine architecture with three process types:

RPC process (based on BRPC). The Python implementation is tuned for multi‑process and coroutine, achieving >5× performance gain.

DAG process executes directed‑acyclic‑graph (DAG) workflows. Multiple DAG processes and async coroutines fully utilize CPU cores.

Model process runs inference on GPU (or Kunlun), isolates memory via shared GPU memory, and supports PyTorch, Paddle, and custom optimizations.

Inference acceleration techniques include:

Dynamic batching and multi‑stream execution.

Poros engine (open‑source at https://github.com/PaddlePaddle/FastDeploy/tree/develop/poros) that combines TorchScript, graph optimizations, TensorRT, and vLLM.

Quantization (FP16, INT8/INT4) and model compression (distillation, pruning) for higher throughput.

Compute Scheduling System

All requests pass through a unified FeatureGateway that enforces flow control and routing. The SmartScheduler discovers idle heterogeneous resources across internal PaaS platforms and deploys operators (“算子”) based on metadata from the Model Service Platform.

Two‑stage scheduling separates traffic allocation from resource allocation:

Traffic scheduling : Adjust a normalization coefficient using runtime metrics, map required QPS to NormalizedQps , sort jobs by priority, and assign operator capacity.

Resource scheduling : Convert capacity gaps into instance counts, fit workloads to appropriate hardware queues, sort by scarcity and cost‑effectiveness, pre‑assign resources to sub‑services, and fine‑tune assignments to avoid waste.

Batch Compute Platform & HTAP Storage

Offline feature computation suffers from Table scan bottlenecks caused by mixed OLTP/OLAP workloads, scan amplification, and high scaling cost. Baidu’s solution separates OLAP and OLTP storage:

OLAP storage built on RocksDB + AFS, using incremental snapshots, row‑partitioning, and column‑level merging to reduce I/O amplification.

HTAP SDK provides a unified API for accessing both the original Table and the new OLAP store, allowing simultaneous OLTP and OLAP tasks.

Task Generation & Execution

Three development modes simplify offline task creation:

Configuration‑based : Highly abstracted tasks generated via a web UI.

KQL : A SQL‑like language with user‑defined functions (similar to Spark UDF). Example:

Function classify = {
    def classify(cbytes, ids):
        unique_ids = set(ids)
        classify = int.from_bytes(cbytes, byteorder='little', signed=False)
        while classify != 0:
            tmp = classify & 0xFF
            if tmp in unique_ids:
                return True
            classify = classify >> 8
        return False
}

declare ids = [2, 8];
select * from my_table
convert by json outlet by row filter by function@classify(@cf0:types, @ids);

Offline framework : Full‑featured SDK for custom logic, compiled into deployment packages.

Generated tasks are dispatched to MapReduce or FaaS platforms. KQL tasks undergo parsing before scheduling, while framework‑based tasks follow automated DevOps checks. The scheduler continuously requests quota from the gateway and adapts request rates based on instance count and failure metrics.

Results

The system now supports over a dozen business lines (image search, video search, etc.), hundreds of operators, and billions of daily inference calls, updating trillion‑scale content features routinely. With the rise of large‑model AI, Baidu plans to integrate more generative capabilities and continue evolving the architecture.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI CloudNative Scheduling HTAP ModelServing BatchProcessing LargeScale

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.