Design and Application of Xiaohongshu Heterogeneous Training and Inference Engine
This article presents a comprehensive overview of Xiaohongshu's heterogeneous training and inference engine, covering the challenges of model engineering, the design of elastic heterogeneous engines, future HPC training frameworks, AI compilation techniques, and a forward‑looking outlook on scalability and performance.
Design and Application of Xiaohongshu Heterogeneous Engine
Today we share the design and application of Xiaohongshu's heterogeneous training and inference engine.
The main content includes five parts:
Challenges faced by Xiaohongshu model engineering
Design and implementation of a heterogeneous elastic engine
Future‑oriented HPC training framework
AI compilation technology
Outlook
1. Challenges of Xiaohongshu Model Engineering
In recent years Xiaohongshu's business has grown rapidly, with very high daily exposure, interaction UV and search volume.
From a model‑engineering perspective, the main challenges are:
Increasing model complexity leads to ever larger data volumes.
Computational workflow requirements grow accordingly.
Model application scenarios expand beyond traditional search, advertising and recommendation to e‑commerce, live streaming and other new services.
To cope with rapid business growth we need engine technology that reduces cost and provides iteration space for the business.
2. Design and Practice of the Heterogeneous Elastic Engine
2.1 First‑generation Training Framework
The first‑generation framework Larc is based on a PS‑worker design and includes several core techniques:
Support for ultra‑large‑scale sparse features.
Embedding table implementation using conflict‑free hashing, ensuring each ID is fully learned.
High‑performance Lookup Table operator with extensive OP fusion for better performance.
Support for multiple optimizer parameters.
The framework treats each GPU as a worker node. Because model types are diverse and cloud providers offer many GPU models, the core design places compute‑intensive operators on the GPU while optimizing lookup operations asynchronously.
2.2 Heterogeneous GPU Training Framework
The training process is abstracted into ten steps; after research we chose to split the graph after embedding aggregation (scheme 2), resulting in three stages.
Implementation details:
Sparse sub‑graph (sample parsing, feature extraction, embedding lookup and aggregation) runs on CPU workers.
Dense sub‑graph (dense network computation) runs on GPU workers.
Pairwise Send/Receive ops enable concurrent tensor communication, while message queues decouple CPU and GPU workers for asynchronous execution.
A global dynamic dispatch queue stores only small meta‑information to match N CPU workers with M GPU workers efficiently.
Mixed‑precision handling reduces GPU input bandwidth; a local CPU worker is deployed when GPU‑side CPUs become idle.
2.3 First‑generation GPU Inference Architecture
The inference pipeline consists of three main steps: feature extraction, TensorFlow serving, and multi‑target fusion (ValueModel) computation. Similar to training, GPU utilization is low because CPU usage is high and batch sizes are small.
2.4 Heterogeneous GPU Inference Architecture
To address low GPU utilization, high latency from module splitting, and small batch size issues, we designed a new inference engine that:
Splits the TensorFlow graph into CPU and GPU workers, increasing latency but enabling parallelism.
Applies a data‑packet‑size‑aware dynamic parallelism strategy to reduce tail latency.
Uses zero‑copy optimization for serialization/deserialization stages.
Implements auto‑batching that aggregates multiple requests within a time window into a larger batch, improving GPU compute unit usage.
Redesigns stateful services (e.g., initial ranking) with three‑request protocols to decouple computation from memory and manage TTL on GPU nodes.
3. Future‑Oriented HPC Training Framework
Business growth and rapid algorithm iteration will lead to larger datasets and more complex models, demanding higher training throughput.
Existing PS‑worker based frameworks suffer from scaling inefficiencies and asynchronous convergence issues, and current GPUs (A10, A30) cannot efficiently train massive dense models.
Our next‑generation HPC framework draws inspiration from Baidu AIBox and NVIDIA HugeCTR, featuring:
Pass‑level aggregation with de‑duplication to reduce ID count.
Embedding swapping with pipelined parallelism to keep GPU compute busy.
Incremental swap‑in/out using locality between passes.
Table Fusion that aggregates embedding dimensions to reduce operator count.
The architecture separates IO‑heavy tasks (sample parsing, feature processing) to a CPU cluster, while GPU workers handle dense graph computation, supported by a two‑layer parameter server (HBM‑PS and DRM‑PS).
4. AI Compilation Technology
AI compilation aims to obtain high‑performance models from any level of abstraction. The stack consists of a front‑end (Fourier graph optimizer, TensorFlow Grappler), a middle‑end, and back‑ends such as XLA, TVM, TensorRT.
Front‑end optimizations include rule‑based sub‑graph matching and rewriting, and replacement of TensorFlow CPU MatMul with the high‑efficiency Onnx MLAS library.
For the back‑end we selected XLA, solving two problems:
Static Batching to provide a fixed input shape for XLA in inference.
Switching from JIT to AOT for better online stability and support for heterogeneous hardware.
5. Summary and Outlook
The heterogeneous training and inference engine delivers:
High performance with CPU and GPU utilization between 65 % and 95 %.
Flexibility to choose optimal compute‑splitting strategies per model and device.
High ROI on iteration, allowing resource requests based on region and model consumption.
Scalability for mixed‑workload tidal deployments.
Future work will focus on further AI compilation improvements, HPC synchronous training, heterogeneous parameter servers, and more flexible elastic training engines.
We welcome like‑minded engineers to join the team.
Speaker: Zeng Mingkun, Head of Xiaohongshu Training and Inference Engine
Editor: Lou Zhengyu
Proofreader: Li Yao
Community: DataFun
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.