How Skrull Boosts Long-Context Fine‑Tuning Speed Up to 7.5×
The Skrull system, accepted at NeurIPS 2025, dynamically schedules long and short sequences during each training iteration, overlapping communication and computation to achieve up to 7.54× speedup for long‑context fine‑tuning of large language models while maintaining stability through load‑balancing and rollback mechanisms.
Research Background
Long‑context fine‑tuning (Long‑SFT) is essential for extending large language models to handle very long texts. Training datasets for Long‑SFT often exhibit a long‑tail or bimodal length distribution: the majority of sequences are short, while a minority are extremely long. This heterogeneity creates severe efficiency bottlenecks for existing context‑parallel training pipelines because attention FLOPs grow quadratically with sequence length while memory usage grows linearly.
Method Overview (Skrull)
Skrull addresses these bottlenecks by dynamically partitioning each training iteration into two independent groups:
Distributed‑compute group : processed with the standard context‑parallel pipeline across multiple GPUs, exchanging KV‑cache as usual.
Local‑compute group : assigned entirely to a single GPU so that the full sequence resides locally, eliminating inter‑GPU communication for that group.
Because the two groups have no data dependency, their communication and computation phases can overlap, improving overall throughput.
Load Balancing and Safety Mechanisms
Skrull monitors the actual FLOPs executed on each GPU to assess load imbalance. A heuristic scheduler uses these statistics to decide which sequences should belong to the local‑compute group while respecting a hard memory constraint called BucketSize. BucketSize is the maximum total token length that a single GPU can hold for the local group, derived from the linear relationship between sequence length and memory consumption.
If assigning a batch would exceed BucketSize, Skrull triggers a rollback that discards the offending batch, preventing out‑of‑memory crashes. The scheduler also operates at the global‑batch level: it sorts sequences by length and interleaves long and short examples so that each micro‑batch receives a balanced mix, further smoothing the FLOP distribution.
Experimental Evaluation
Experiments were conducted on Qwen‑0.5B and Qwen‑7B models using three representative datasets that cover long‑tail and bimodal length distributions. Skrull was compared against DeepSpeed Zero‑2 and a baseline that simply sorts batches by length.
Average speedup over baselines: 3.76×
Peak speedup observed: 7.54×
Additional studies varied BatchSize and BucketSize, evaluated larger model sizes, and verified compatibility with LoRA fine‑tuning. Ablation experiments showed that both the heuristic scheduling and the rollback mechanism contribute substantially to the performance gains.
Paper Information
Title: Skrull: Towards Efficient Long Context Fine‑tuning through Dynamic Data Scheduling
Authors: Hongtao Xu, Wenting Shen, Yuanxin Wei, Ang Wang, Guo Runfan, Tianxing Wang, Yong Li, Mingzhen Li, Weile Jia
ArXiv link: https://arxiv.org/abs/2505.19609
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
