Artificial Intelligence 8 min read

HybridBackend Accelerates GPU-Based Recommendation Model Training for Ximalaya AI Cloud

Ximalaya AI Cloud adopted the open‑source HybridBackend framework to overcome sparse‑data bottlenecks, enabling columnar Parquet reads and hybrid parallel GPU training that boost GPU utilization by over threefold, cut recommendation model training time by more than half, and now powers all TensorFlow and DeepRec production models.

Ximalaya Technology Team

Oct 23, 2023

HybridBackend Accelerates GPU-Based Recommendation Model Training for Ximalaya AI Cloud

Ximalaya AI Cloud leverages the open‑source HybridBackend framework to achieve efficient GPU training for its recommendation models, which power key app features such as Hot Topics, "You May Like", Private FM, homepage feeds, discovery page, and Daily Must‑Listen.

The shift from CPU to GPU exposed serious resource‑utilization problems. Sparse data stored in classic libsvm format required downloading large string blobs from remote storage (OSS), parsing them into feature vectors, and feeding them to embedding tables, which consumed excessive network bandwidth and CPU cycles. Distributed training with keras+horovod suffered from unstable acceleration and degraded model metrics, while a custom parameter‑server (PS) solution introduced frequent I/O on custom ps‑pull/push operators, becoming a new bottleneck.

HybridBackend, an open‑source framework released at ICDE 2022 and installable via pip, provides deep optimizations for sparse data access, sparse computation, and distributed training. It offers a simple API (hb.data.Dataset) and is compatible with TensorFlow, DeepRec, and other training stacks.

For sparse data access, HybridBackend supports columnar formats such as Parquet, enabling selective column parsing and parallel reads (num_parallel_reads, num_parallel_parser_calls). This reduces network traffic, lowers CPU load, and boosts GPU utilization by more than 3×, dramatically shortening training cycles.

HybridBackend also introduces a hybrid parallel training mode where each GPU holds all dense parameters and a partition of sparse parameters, communicating via NCCL over NVLink instead of traditional RPC‑based PS. This architecture improves training speed and GPU utilization, especially for large‑scale recommendation models.

After full integration, single‑node multi‑GPU training saw average GPU utilization increase by over 1.4× and overall training time drop by more than 50%. The HybridBackend‑based solution has been rolled out to all TensorFlow and DeepRec models in production.

Future work includes operator‑level optimizations for embedding look‑up, adding PyTorch support for NLP recommendation scenarios, and scaling the system to handle ultra‑large distributed training with billions of samples and feature dimensions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed training AI cloud GPU training HybridBackend Sparse Data Optimization

Written by

Ximalaya Technology Team

Official account of Ximalaya's technology team, sharing distilled technical experience and insights to grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.