GPU Performance Optimization Practices for Tencent PCG Recommendation Model Training Framework
This article presents a comprehensive overview of Tencent PCG's GPU‑based recommendation model training framework, detailing why GPU adoption is essential, the hardware and software challenges faced, the multi‑level data architecture, pipeline design, and a series of network, storage, and compute optimizations, followed by future directions.
The presentation introduces the need for a GPU‑accelerated recommendation model training framework at Tencent PCG, explaining that CPU‑based training suffered from limited network bandwidth, unstable connections, heterogeneous CPU models, and shared container I/O, making parameter‑server architectures unsuitable for large‑scale models.
To overcome these issues, the team adopted a single‑machine multi‑GPU design, focusing on hardware selection, data pipeline separation, and support for both XPU and GPU devices. The software stack primarily uses TensorFlow with the ability to switch to PyTorch, while maintaining compatibility with existing parameter‑server solutions for seamless migration.
The data architecture is organized into a four‑level hierarchy: SSD storage, host memory, GPU HBM, and a three‑tier cache system. Data is partitioned into groups, passes, and batches, enabling efficient SSD‑to‑host‑memory and host‑memory‑to‑GPU transfers. A CSR‑based format and dynamic embedding tables reduce memory footprint and improve cache utilization.
The training pipeline separates data download, preprocessing, and computation. Large data groups are pre‑downloaded, preprocessed into CSR batches, and flashed into GPU memory as whole passes. Concurrent stages allow multi‑level pipelining, maximizing hardware bandwidth.
Optimization techniques include network acceleration with DPDK and BBR, DMA offloading, SSD parallelism via LVM, direct I/O, and Parquet columnar storage; compute optimizations such as unified hash tables with key re‑hashing, kernel merging, mixed‑precision training, dynamic embeddings, multiple hash tables, and asymmetric hash structures; and storage optimizations like INT quantization and asynchronous I/O.
Future work envisions training on alternative GPUs (A10/T4/P4), integrating large language models with recommendation systems, exploring hybrid PS‑GPU architectures for even larger models (up to PB scale), and achieving higher performance with lower‑cost hardware configurations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
