How PGLBox Achieves 27× Faster GPU‑Powered Large‑Scale Graph Learning

PGLBox, Baidu’s GPU‑based large‑scale graph training framework, delivers up to 27× speedup over CPU clusters by fully GPU‑accelerating storage, sampling, and training, supporting billions of nodes, advanced GNN algorithms, multi‑level storage, and seamless integration of massive pretrained models.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
How PGLBox Achieves 27× Faster GPU‑Powered Large‑Scale Graph Learning

Background

Graph Neural Networks (GNN) are deep‑learning models that operate directly on graph‑structured data. Traditional large‑scale graph training systems are built on CPU clusters with separate parameter servers, which leads to high inter‑machine communication, limited scalability, and unstable performance when the graph contains billions of nodes and edges.

PGLBox Overview

PGLBox is a GPU‑centric framework for training massive graph models. It is integrated with the PaddlePaddle deep‑learning platform and inherits the flexible Graph4Rec API. The system can handle graphs with hundreds of billions of nodes and edges while keeping the programming model simple.

Architectural Innovations

Full‑GPU pipeline : Graph storage, random‑walk generation, neighbor sampling, and model training are all executed on GPUs, eliminating costly CPU‑GPU data transfers.

Multi‑level storage hierarchy : The static graph topology resides entirely in GPU memory; node attributes are stored in a two‑level hierarchy (GPU memory + host memory); model parameters use a three‑level hierarchy (GPU memory + NVMe + CPU memory). This design expands the feasible graph size by an order of magnitude.

Intelligent communication : The framework detects NVLink and non‑full‑mesh network topologies and inserts smart relay nodes to reduce cross‑machine traffic.

Balanced training : Dynamic pass‑size smoothing smooths the memory footprint across training steps, lowering peak GPU memory usage and enabling larger graphs on a single machine.

Performance

Compared with conventional MPI‑based CPU distributed solutions, PGLBox achieves roughly 27× higher training throughput. The pipeline architecture maximizes utilization of heterogeneous hardware (GPU compute, NVLink, PCIe) and the intelligent communication layer mitigates network bottlenecks.

Algorithmic Support

PGLBox bundles a wide range of GNN algorithms and adds support for large‑scale pretrained models such as ERNIE (language) and ERNIE‑ViL (cross‑modal). These models can be loaded together with massive graph structures, enabling end‑to‑end learning over heterogeneous node features (text, images, user profiles, geolocation) and discrete identifiers (user ID, item ID) via a GPU‑accelerated parameter server.

Open‑Source Repository

The full source code is available at https://github.com/PaddlePaddle/PGL/tree/main/apps/PGLBox. Users can clone the repository, contribute patches, and report issues through the standard GitHub workflow.

References

https://arxiv.org/abs/2112.01035
https://mp.weixin.qq.com/s/aSxFpkyX5MyFYLfZuIagzg
https://ogb.stanford.edu/neurips2022/results/
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

GPUgraph neural networksdistributed computingpretrained modelsLarge-Scale TrainingPGLBox
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.