Achieving 4.6× Faster Diffusion Model Training with FP4‑BF16 Dual‑Track Parallelism (Sol‑RL)
Sol‑RL, a framework from NVIDIA, Hong Kong University and MIT, integrates NVFP4 inference for large‑scale rollout exploration and BF16 precision for high‑fidelity regeneration, delivering up to 4.64× faster convergence at equivalent reward levels while preserving BF16 training fidelity across SANA, FLUX.1 and SD3.5‑L models.
When reinforcement‑learning‑based fine‑tuning of large‑scale diffusion models increases rollout size to improve preference alignment, inference cost becomes the primary bottleneck. Sol‑RL (Speed‑of‑light RL) addresses this by using a two‑stage "FP4 explore, BF16 train" pipeline that achieves up to 4.64× faster convergence at equivalent reward levels.
Research Background
In the post‑training phase of text‑to‑image models, reinforcement learning (RL) has proven effective for aligning generated images with human preferences. Prior work shows that expanding rollout—generating many candidate images per prompt and selecting the most and least preferred n samples for contrastive learning—significantly improves alignment because it provides stronger gradient signals.
However, scaling rollout shifts the training bottleneck from parameter updates to the massive cost of generating candidate samples, especially for large diffusion models such as FLUX.1 and SD3.5‑L that require multiple inference passes. Directly applying low‑bit (FP4) quantized inference to these rollouts degrades training stability and final quality, so the key question is not whether FP4 can be used, but how it should be integrated into the training pipeline.
Core Innovation
Sol‑RL decouples rollout exploration and high‑fidelity generation into two distinct precision stages. Instead of using low‑precision samples throughout training, Sol‑RL lets NVFP4 rollout handle high‑throughput exploration, quickly filtering a large pool of initial noise seeds to retain only those with the highest and lowest reward scores. The selected seeds are then regenerated with BF16 precision for accurate sample generation and subsequent policy optimization.
The authors demonstrate that FP4 samples, while biased at the pixel level, preserve the relative reward ordering of BF16 samples for the same noise seed, making FP4 suitable for large‑scale candidate selection without harming the contrastive signal.
Method Overview
The Sol‑RL workflow consists of two phases:
NVFP4 rollout with reduced sampling steps rapidly creates a massive candidate pool and ranks candidates by reward, extracting the top‑scoring and bottom‑scoring noise seeds.
BF16 precision regenerates high‑fidelity images from the retained seeds, and only these high‑quality samples are used for RL policy updates.
This design concentrates expensive BF16 computation on a small, high‑impact subset of candidates, while FP4 handles the bulk of exploration, improving both efficiency and stability.
Experimental Results
Across SANA, FLUX.1, and SD3.5‑L models, Sol‑RL consistently outperforms baseline methods. Under the same GPU‑hour budget, it reaches equivalent reward levels up to 4.64× faster and achieves higher alignment quality within a fixed wall‑clock time.
Time‑breakdown analysis shows that, compared with full‑precision rollout scaling, Sol‑RL accelerates the rollout stage by up to 2.41× and reduces training iteration time by up to 1.62×. The dual‑stage design mitigates the compute bottleneck of BF16‑only scaling and adds only ~2% overhead relative to an all‑FP4 low‑precision baseline.
Conclusion and Outlook
Sol‑RL demonstrates that assigning low‑precision FP4 to the exploration phase and high‑precision BF16 to the optimization phase yields a practical, scalable solution for large‑scale rollout in diffusion model RL fine‑tuning. This redefines the role of FP4 from a mere inference accelerator to an effective exploration proxy, offering a realistic path for researchers and engineers working on post‑training, preference alignment, low‑bit quantization, and system‑level optimization of generative models.
Paper: "FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling" (arXiv:2604.06916). Project page: https://nvlabs.github.io/Sana/Sol-RL/. Code: https://github.com/NVlabs/Sana/.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
