Artificial Intelligence 19 min read

Design and Application of Xiaohongshu Heterogeneous Training and Inference Engine

This article presents a comprehensive overview of Xiaohongshu's heterogeneous training and inference engine, covering the challenges of model engineering, the design of elastic heterogeneous engines, future HPC training frameworks, AI compilation techniques, and a forward‑looking outlook on scalability and performance.

DataFunSummit

Aug 12, 2024

Design and Application of Xiaohongshu Heterogeneous Training and Inference Engine

Design and Application of Xiaohongshu Heterogeneous Engine

Today we share the design and application of Xiaohongshu's heterogeneous training and inference engine.

The main content includes five parts:

Challenges faced by Xiaohongshu model engineering

Design and implementation of a heterogeneous elastic engine

Future‑oriented HPC training framework

AI compilation technology

Outlook

1. Challenges of Xiaohongshu Model Engineering

In recent years Xiaohongshu's business has grown rapidly, with very high daily exposure, interaction UV and search volume.

From a model‑engineering perspective, the main challenges are:

Increasing model complexity leads to ever larger data volumes.

Computational workflow requirements grow accordingly.

Model application scenarios expand beyond traditional search, advertising and recommendation to e‑commerce, live streaming and other new services.

To cope with rapid business growth we need engine technology that reduces cost and provides iteration space for the business.

2. Design and Practice of the Heterogeneous Elastic Engine

2.1 First‑generation Training Framework

The first‑generation framework Larc is based on a PS‑worker design and includes several core techniques:

Support for ultra‑large‑scale sparse features.

Embedding table implementation using conflict‑free hashing, ensuring each ID is fully learned.

High‑performance Lookup Table operator with extensive OP fusion for better performance.

Support for multiple optimizer parameters.

The framework treats each GPU as a worker node. Because model types are diverse and cloud providers offer many GPU models, the core design places compute‑intensive operators on the GPU while optimizing lookup operations asynchronously.

2.2 Heterogeneous GPU Training Framework

The training process is abstracted into ten steps; after research we chose to split the graph after embedding aggregation (scheme 2), resulting in three stages.

Implementation details:

Sparse sub‑graph (sample parsing, feature extraction, embedding lookup and aggregation) runs on CPU workers.

Dense sub‑graph (dense network computation) runs on GPU workers.

Pairwise Send/Receive ops enable concurrent tensor communication, while message queues decouple CPU and GPU workers for asynchronous execution.

A global dynamic dispatch queue stores only small meta‑information to match N CPU workers with M GPU workers efficiently.

Mixed‑precision handling reduces GPU input bandwidth; a local CPU worker is deployed when GPU‑side CPUs become idle.

2.3 First‑generation GPU Inference Architecture

The inference pipeline consists of three main steps: feature extraction, TensorFlow serving, and multi‑target fusion (ValueModel) computation. Similar to training, GPU utilization is low because CPU usage is high and batch sizes are small.

2.4 Heterogeneous GPU Inference Architecture

To address low GPU utilization, high latency from module splitting, and small batch size issues, we designed a new inference engine that:

Splits the TensorFlow graph into CPU and GPU workers, increasing latency but enabling parallelism.

Applies a data‑packet‑size‑aware dynamic parallelism strategy to reduce tail latency.

Uses zero‑copy optimization for serialization/deserialization stages.

Implements auto‑batching that aggregates multiple requests within a time window into a larger batch, improving GPU compute unit usage.

Redesigns stateful services (e.g., initial ranking) with three‑request protocols to decouple computation from memory and manage TTL on GPU nodes.

3. Future‑Oriented HPC Training Framework

Business growth and rapid algorithm iteration will lead to larger datasets and more complex models, demanding higher training throughput.

Existing PS‑worker based frameworks suffer from scaling inefficiencies and asynchronous convergence issues, and current GPUs (A10, A30) cannot efficiently train massive dense models.

Our next‑generation HPC framework draws inspiration from Baidu AIBox and NVIDIA HugeCTR, featuring:

Pass‑level aggregation with de‑duplication to reduce ID count.

Embedding swapping with pipelined parallelism to keep GPU compute busy.

Incremental swap‑in/out using locality between passes.

Table Fusion that aggregates embedding dimensions to reduce operator count.

The architecture separates IO‑heavy tasks (sample parsing, feature processing) to a CPU cluster, while GPU workers handle dense graph computation, supported by a two‑layer parameter server (HBM‑PS and DRM‑PS).

4. AI Compilation Technology

AI compilation aims to obtain high‑performance models from any level of abstraction. The stack consists of a front‑end (Fourier graph optimizer, TensorFlow Grappler), a middle‑end, and back‑ends such as XLA, TVM, TensorRT.

Front‑end optimizations include rule‑based sub‑graph matching and rewriting, and replacement of TensorFlow CPU MatMul with the high‑efficiency Onnx MLAS library.

For the back‑end we selected XLA, solving two problems:

Static Batching to provide a fixed input shape for XLA in inference.

Switching from JIT to AOT for better online stability and support for heterogeneous hardware.

5. Summary and Outlook

The heterogeneous training and inference engine delivers:

High performance with CPU and GPU utilization between 65 % and 95 %.

Flexibility to choose optimal compute‑splitting strategies per model and device.

High ROI on iteration, allowing resource requests based on region and model consumption.

Scalability for mixed‑workload tidal deployments.

Future work will focus on further AI compilation improvements, HPC synchronous training, heterogeneous parameter servers, and more flexible elastic training engines.

We welcome like‑minded engineers to join the team.

Speaker: Zeng Mingkun, Head of Xiaohongshu Training and Inference Engine

Editor: Lou Zhengyu

Proofreader: Li Yao

Community: DataFun

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

.ai Model Training inference HPC AI Compilation Heterogeneous Engine

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.