Artificial Intelligence 24 min read

PaddleBox and FeaBox: GPU‑Based Large‑Scale Sparse Model Training and Integrated Feature Extraction Frameworks at Baidu

The article introduces PaddleBox and FeaBox, two GPU‑driven frameworks designed for massive sparse DNN training and unified feature extraction, detailing their architecture, performance advantages, hardware‑software co‑design challenges, and successful deployment across Baidu's advertising systems.

DataFunSummit
DataFunSummit
DataFunSummit
PaddleBox and FeaBox: GPU‑Based Large‑Scale Sparse Model Training and Integrated Feature Extraction Frameworks at Baidu

This article presents PaddleBox, Baidu's first GPU‑only large‑scale sparse DNN training framework, and FeaBox, an integrated GPU‑based feature extraction system, both built on the GPUBox platform to achieve high performance, low cost, and high stability for trillion‑parameter models.

It outlines the evolution of Baidu's CTR modeling, from early LR models to continuous‑value DNNs and distributed CPU parameter servers, highlighting the limitations of CPU‑centric solutions and the need for GPU acceleration.

The authors describe the three‑layer heterogeneous parameter server architecture (SSD, memory, and distributed GPU sparse servers) that enables single‑machine training of 10 TB models and multi‑machine scaling, addressing storage, performance, and communication challenges through innovations such as NVLink, NVSwitch, and custom GPU sparse parameter servers.

FeaBox is introduced as a one‑stop feature processing pipeline that combines CPU and GPU operators, employs dynamic GPU memory pooling, and leverages DAG‑based heterogeneous scheduling to dramatically reduce I/O, cut storage to zero, and accelerate feature research from days to minutes.

Hardware co‑design with Baidu's XMan supercomputers (versions 1.0‑4.0) is discussed, showing how flexible CPU‑GPU configurations, high‑bandwidth interconnects, and optimized SSD layouts support the frameworks' demanding workloads.

Finally, the article reports successful deployment of PaddleBox and FeaBox across Baidu's advertising platforms (Fengchao and Baicheng), delivering 5‑40× cost efficiency improvements, supporting diverse models (CTR, CVR, recall), and expanding AI capabilities beyond advertising.

Feature ExtractionGPUdistributed trainingAI infrastructurePaddleBoxsparse modelsFeaBox
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.