Artificial Intelligence 21 min read

Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance

Xiaohongshu’s team unveiled a self‑developed RLHF system that trains multimodal large language models using heterogeneous and homogeneous network architectures, extensive PPO optimizations, and Medusa speculative sampling, achieving over 50% throughput gains, reduced hardware needs, and 5‑20% performance improvements on zero‑shot benchmarks.

Xiaohongshu Tech REDtech

Jan 2, 2025

Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance

At QCon Shanghai 2024, the Xiaohongshu large model team presented their self-developed RLHF (Reinforcement Learning from Human Feedback) system for training multimodal large language models (MLLM). The talk covered the challenges posed by long-text, multimodal inputs, and PPO training complexity, and described how the AGI team overcame these through heterogeneous and homogeneous network architectures, end-to-end training-inference optimizations, and demonstrated performance gains over open-source frameworks.

The article begins with an introduction to reinforcement learning principles, defining state as the input prompt, action as the model's response, and reward as the score from a reward model. It explains the two-stage RLHF process: reward model training and PPO-based policy optimization, detailing the actor, critic, reward, and reference models involved in PPO.

System design sections discuss the overall architecture, choosing Megatron‑core for training and Ray for scheduling, with vLLM as the inference engine while ensuring consistency via Megatron for log‑probability computation. The heterogeneous network architecture uses forward offload to reuse GPU memory, reducing required devices from 4× to 2×, and separates actor and critic clusters for asynchronous training, achieving over 50% throughput improvement over baselines like trlx/openrlhf.

Further scaling is addressed by a homogeneous network architecture that offloads parameters to enable cluster reuse, cutting cluster size to that of a single SFT training job. The article then details performance optimizations: data prefetch, TP/PP/CP/SP parallelism, recompute for memory saving, dynamic batch sizing, load‑balanced vLLM engines, and pipeline parallelism that overlaps generate and forward stages to reduce make‑experience latency.

Specialized optimizations for multimodal LLMs include using LLM tensor‑pipeline parallelism as data parallelism for the visual module, multi‑reuse of image features across vLLM, actor, and critic, and prefetching frozen visual models to overlap with training. Consistency between training and inference is enforced by using the same framework, local offload of reward‑model serving, and aligned NCCL groups and BLAS settings to avoid reward‑model score drift.

To improve sampling efficiency, the team adopts Medusa speculative sampling with companion training to keep heads updated, yielding over 50% speed‑up in generation without precision loss. Additional sections cover observed gains in general capabilities (5‑20% improvement on zero‑shot benchmarks), process‑reward models (PRM) for better interpretability and a further ~5% boost, hyper‑parameter tuning tips, and future work centered on the MATRIX unified framework and algorithmic scaling‑law exploration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance System Optimization RLHF distributed training PPO medusa multimodal LLM PRM

Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.