Why Data, Not Architecture, Drives Locality in Diffusion Models

A recent MIT‑Toyota study shows that the locality observed in image diffusion models emerges from the statistical structure of training data rather than from architectural biases, and a simple linear denoiser can replicate this behavior, reshaping how we think about model design.

Data Party THU
Data Party THU
Data Party THU
Why Data, Not Architecture, Drives Locality in Diffusion Models

What Diffusion Models Do

Text‑to‑image diffusion models generate realistic images by iteratively denoising random noise. The conventional view attributes their success to architectural inductive biases such as the locality and shift‑equivariance of U‑Net.

Locality : Convolutional kernels process overlapping patches, assuming nearby pixels are more strongly correlated.

Shift Equivariance : Translating an object in the input results in a corresponding translation in the output.

The “Perfect” Denoiser

In the diffusion framework a theoretical optimal denoiser would return the most likely original image for any noisy input. Such a denoiser behaves like a nearest‑neighbor memory lookup: it reproduces training images but cannot generate novel content.

Linear Denoiser as a Near‑Parameter‑Free Model

The authors construct a linear denoiser that learns only the pairwise pixel correlations across the entire training set, without any built‑in locality assumptions. This model acts as a statistical estimator that captures the second‑order statistics of the data.

Sensitivity Fields

To understand where a denoiser “looks” when reconstructing a central pixel, the study visualizes sensitivity fields (the gradient of the output pixel with respect to all other pixels).

Figure 1: Optimal denoiser vs. deep denoiser
Figure 1: Optimal denoiser vs. deep denoiser

Figure 1 shows the optimal denoiser merely copying nearest‑neighbor patches, while a deep U‑Net exhibits a compact, roughly circular sensitivity region, confirming the conventional view of strong locality.

Figure 2: Sensitivity of linear model
Figure 2: Sensitivity of linear model

Figure 2 demonstrates that the linear model, despite lacking any locality prior, learns a sensitivity field almost identical to the deep U‑Net, indicating that locality can emerge from data statistics alone.

Figure 3: Sensitivity on CIFAR‑10
Figure 3: Sensitivity on CIFAR‑10

On CIFAR‑10 the sensitivity field retains structure that mirrors object shapes, further supporting the data‑driven emergence of locality.

From Statistics to Generation

The authors compare the analytical linear model with a fully trained state‑of‑the‑art DDPM. The linear model produces slightly blurrier images but preserves correct color, shape, and semantic structure.

Figure 4: Linear model vs. DDPM
Figure 4: Linear model vs. DDPM

This result suggests that a large portion of generative ability stems from learned second‑order pixel correlations rather than from complex architectural design.

Implications for Engineering and Research

When abundant high‑quality data are available, many handcrafted inductive biases may be unnecessary.

Simple linear denoisers can be used as diagnostic tools to probe dataset statistics before committing to deeper architectures.

Data curation, bias mitigation, and statistical analysis become the primary factors determining model performance.

Conclusion

The paper demonstrates that locality in diffusion models is a natural consequence of second‑order pixel statistics. Extending the analysis to higher‑order statistics (e.g., pixel triplets) could further improve simple models, indicating a promising direction for future research.

“We provide evidence that locality in deep diffusion models is driven by the statistical properties of image datasets, not by the inductive bias of convolutional networks.”

Paper: https://arxiv.org/abs/2509.09672

Code example

来源:DeepHub IMBA
本文
约2500字
,建议阅读
5
分钟
本文介绍矩阵指数在机器人逆运动学应用,及图像扩散模型局部性源于数据统计
。
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

image generationdiffusion modelsU-NetData Statisticslinear modellocality
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.