Why Data, Not Architecture, Drives Locality in Diffusion Models
A recent MIT‑Toyota study shows that the locality observed in image diffusion models emerges from the statistical structure of training data rather than from architectural biases, and a simple linear denoiser can replicate this behavior, reshaping how we think about model design.
What Diffusion Models Do
Text‑to‑image diffusion models generate realistic images by iteratively denoising random noise. The conventional view attributes their success to architectural inductive biases such as the locality and shift‑equivariance of U‑Net.
Locality : Convolutional kernels process overlapping patches, assuming nearby pixels are more strongly correlated.
Shift Equivariance : Translating an object in the input results in a corresponding translation in the output.
The “Perfect” Denoiser
In the diffusion framework a theoretical optimal denoiser would return the most likely original image for any noisy input. Such a denoiser behaves like a nearest‑neighbor memory lookup: it reproduces training images but cannot generate novel content.
Linear Denoiser as a Near‑Parameter‑Free Model
The authors construct a linear denoiser that learns only the pairwise pixel correlations across the entire training set, without any built‑in locality assumptions. This model acts as a statistical estimator that captures the second‑order statistics of the data.
Sensitivity Fields
To understand where a denoiser “looks” when reconstructing a central pixel, the study visualizes sensitivity fields (the gradient of the output pixel with respect to all other pixels).
Figure 1 shows the optimal denoiser merely copying nearest‑neighbor patches, while a deep U‑Net exhibits a compact, roughly circular sensitivity region, confirming the conventional view of strong locality.
Figure 2 demonstrates that the linear model, despite lacking any locality prior, learns a sensitivity field almost identical to the deep U‑Net, indicating that locality can emerge from data statistics alone.
On CIFAR‑10 the sensitivity field retains structure that mirrors object shapes, further supporting the data‑driven emergence of locality.
From Statistics to Generation
The authors compare the analytical linear model with a fully trained state‑of‑the‑art DDPM. The linear model produces slightly blurrier images but preserves correct color, shape, and semantic structure.
This result suggests that a large portion of generative ability stems from learned second‑order pixel correlations rather than from complex architectural design.
Implications for Engineering and Research
When abundant high‑quality data are available, many handcrafted inductive biases may be unnecessary.
Simple linear denoisers can be used as diagnostic tools to probe dataset statistics before committing to deeper architectures.
Data curation, bias mitigation, and statistical analysis become the primary factors determining model performance.
Conclusion
The paper demonstrates that locality in diffusion models is a natural consequence of second‑order pixel statistics. Extending the analysis to higher‑order statistics (e.g., pixel triplets) could further improve simple models, indicating a promising direction for future research.
“We provide evidence that locality in deep diffusion models is driven by the statistical properties of image datasets, not by the inductive bias of convolutional networks.”
Paper: https://arxiv.org/abs/2509.09672
Code example
来源:DeepHub IMBA
本文
约2500字
,建议阅读
5
分钟
本文介绍矩阵指数在机器人逆运动学应用,及图像扩散模型局部性源于数据统计
。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
