Unlocking Unusual Concept Combinations in Generative AI with IMBA Loss
The paper identifies imbalanced concept distributions as the main obstacle to arbitrary concept‑combination in text‑to‑image/video generation, proposes the token‑level IMBA Distance and a lightweight IMBA Loss that adaptively re‑weights training tokens, and demonstrates through extensive experiments and a new Inert‑CompBench benchmark that this loss dramatically improves compositional ability without extra data.
Motivation
State‑of‑the‑art diffusion models (e.g., Stable Diffusion 3, DALL·E 3) often fail to generate images that correctly combine arbitrary concepts, especially when the desired relation is rare or counter‑intuitive. Typical failure modes include missing concepts, attribute leakage, and inconsistent image‑text pairs.
Factors Influencing Concept Combination
Large‑scale experiments on a 31 M high‑quality text‑image dataset show that, once model capacity and data volume reach a sufficient scale, the imbalance of concept frequencies in the training set becomes the dominant bottleneck. Controlling for model size and total data, models trained on a balanced concept distribution consistently achieve higher compositional performance.
Adaptive Concept‑Balancing Pre‑training Loss (IMBA Loss)
The authors introduce IMBA Distance , defined as the L_γ norm of the difference between the ground‑truth noise vector ε_gt and the model’s unconditional predicted noise ε_pred at the token level:
IMBA_Distance(t) = \| ε_gt(t) - ε_pred(t) \|_γBecause the distance is larger for under‑represented (tail) concepts, it serves as a precise, token‑wise measure of concept imbalance. During training the distance is used as a dynamic weight for each token, yielding the IMBA Loss :
loss = diffusion_loss * (1 + λ * IMBA_Distance)Only a few lines of code are required to compute the per‑token weight and add it to the standard diffusion objective. The resulting loss improves the model’s ability to generate novel concept combinations in both pre‑training and fine‑tuning stages and generalises to video diffusion models.
Inert‑CompBench: Benchmark for Tail (Lazy) Concepts
Statistical analysis of failure cases reveals that low‑frequency (tail) concepts cause the majority of compositional errors; the authors refer to these as “lazy concepts.” Using a controlled construction procedure (Algorithm 2), they build Inert‑CompBench , a benchmark that evaluates model performance specifically on such tail concepts, complementing existing compositionality benchmarks.
Key Experimental Findings
When model size and data scale are fixed, a balanced concept distribution yields a ~15 % boost in compositional accuracy compared to the original long‑tail distribution.
IMBA Distance is empirically larger for tail concepts across both synthetic and real text‑to‑image evaluations.
Integrating IMBA Loss into the diffusion objective improves zero‑shot compositional generation without any additional data; the improvement persists after fine‑tuning on downstream tasks.
The benchmark Inert‑CompBench exposes a systematic drop in performance on lazy concepts for baseline models, while IMBA‑trained models close the gap.
Conclusion
The study demonstrates that concept‑distribution imbalance is the primary obstacle to arbitrary composition in generative models. By introducing a lightweight, adaptive IMBA Loss that re‑weights token‑level training signals according to IMBA Distance, the authors achieve substantial compositional gains without extra data. The newly proposed Inert‑CompBench provides a focused evaluation suite for future work on rare‑concept composition.
Illustrative Figures
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
