How ACCORD Breaks Concept Coupling in Custom Text‑to‑Image Generation
The ACCORD framework formalizes the concept‑coupling issue in text‑to‑image diffusion models as a statistical dependency problem and resolves it with two plug‑and‑play regularization losses, dramatically improving fidelity and text control without altering model architecture.
Introduction
Custom text‑to‑image generation aims to teach diffusion models specific private concepts—such as a personal pet or a unique product—using only a few reference images. Existing methods often suffer from "concept coupling", where the target concept becomes unintentionally bound to surrounding context in the limited training images.
Root Cause and Quantification
The authors define a Conditional Dependence Coefficient that measures the joint probability of a custom target (e.g., a red backpack) appearing together with an unrelated context element (e.g., a girl) relative to their independent probabilities. A significantly higher coefficient for the target‑context pair than for the parent‑concept pair indicates unwanted statistical dependence.
Analysis reveals two distinct sources of this bias:
Denoising Dependence Discrepancy : The bias accumulates across the iterative denoising steps of the diffusion process.
Prior Dependence Discrepancy : Fine‑tuning shifts the learned representation of the custom concept, disrupting its original dependency network.
Proposed Regularization Losses
DDLoss (Denoising Decoupling Loss)
DDLoss penalizes changes in the conditional dependence between adjacent denoising timesteps, effectively reminding the model not to increase the binding between the custom target and unrelated concepts at any step.
PDLoss (Prior Decoupling Loss)
PDLoss leverages CLIP’s semantic space to align the cosine similarity between the custom target and generic text concepts with the similarity between the parent concept and the same texts, correcting the shifted prior dependencies.
arXiv: https://arxiv.org/abs/2503.01122
Github: https://github.com/antgroup/ACCORDExperimental Results
Both losses are lightweight and architecture‑agnostic, requiring no extra regularization datasets and can be seamlessly attached to existing fine‑tuning pipelines. Evaluations on DreamBench (object customization), StyleBench (style customization), and FFHQ (face customization) show that ACCORD consistently mitigates concept coupling while substantially improving text controllability and preserving subject fidelity, breaking the traditional trade‑off between fidelity and control.
Conclusion
ACCORD demonstrates that introducing statistically grounded regularization provides a clear and rigorous path to enable custom generation that both remembers specific objects and retains creative flexibility.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
