Dynamic Difficulty-Adaptive Training Gains Momentum: Huawei’s EDCO Cited at ICML 2026
EDCO, Huawei’s entropy‑based dynamic curriculum method, continuously selects the most uncertain samples for domain‑specific LLM fine‑tuning, achieving higher accuracy and more stable gradients across communication, medical, and legal tasks while cutting entropy‑estimation cost by over 80 %.
Background
In domain‑specific LLM fine‑tuning, data scarcity and high acquisition cost make indiscriminate scaling ineffective. Traditional static curricula (easy‑to‑hard) or random sampling ignore the model’s evolving competence.
EDCO Overview
Huawei GTS’s AI Data team proposes Entropy‑based Dynamic Curriculum Orchestration (EDCO). The method estimates the inference entropy of each training sample with the current model, selects the highest‑entropy samples as the next curriculum, and repeats the loop.
Key components:
Entropy as difficulty signal – higher inference entropy indicates the model is uncertain, thus the sample provides stronger learning signal.
Prefix‑entropy approximation – quick‑answer prompting followed by computing conditional entropy on the first few tokens reduces per‑sample cost from 2.24 s to 0.37 s (≈ 83.5 % saving).
Dynamic top‑N selection – at each interval the top‑N high‑entropy samples are re‑estimated and form the next training batch.
Experimental Setup
Experiments cover three domains (communication, medical, legal) using two backbone models (Qwen‑3‑4B and Llama‑3.2‑3B) and two fine‑tuning paradigms (SFT and RLFT). In the communication domain, two tasks are defined: Wireless (network‑optimization) and Datacom (multi‑vendor log analysis).
Results
RLFT on Datacom: EDCO achieves 46.96 % accuracy, surpassing random sampling (40.43 %) and PPL‑based curriculum (44.78 %). Wireless: 38.70 % vs lower baselines.
SFT results: Wireless 33.7 %, Datacom 36.3 %; MedQA 36.7 %; JEC‑QA 17.4 % – all highest among compared methods.
Compared with Dynamic‑PPL and SEC baselines on Datacom, EDCO reaches 47.0 % vs 41.3 % and 34.78 %, highlighting the importance of the entropy signal.
Gradient analysis on MedQA (Qwen‑3‑1.7B) shows EDCO’s selected batches have gradient direction consistency 0.92 (vs 0.82 random) and average inference entropy 1.51 (vs 1.23), while RL gradient norm is 3.77 (vs 2.62), indicating stronger, less conflicting learning signals.
Mechanism Insight
EDCO maintains higher inference entropy throughout training, preventing premature confidence collapse seen with static curricula. Sample turnover analysis shows ~3000 new samples enter the curriculum after the first interval, with continual addition of previously unseen high‑entropy samples and retention of lingering difficult examples.
Efficiency
Prefix‑entropy estimation reduces per‑sample cost by 83.5 %; on 8 GPUs the time drops to 0.04 s, making the dynamic curriculum practical for large pools.
Conclusion
EDCO demonstrates that data value is a function of the model’s current state. By driving curriculum with inference entropy and keeping overhead low, it improves fine‑tuning performance across multiple domains without altering model architecture or training objectives, and works with both SFT and RLFT.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
