ByteDance’s Diffusion Restoration Adapter Achieves State‑of‑the‑Art Real‑World Image Recovery
This paper introduces a lightweight Diffusion Restoration Adapter that integrates into pre‑trained diffusion priors such as StableDiffusion XL and StableDiffusion 3, dramatically reduces parameter overhead compared with ControNet, and delivers superior quantitative and visual results on real‑world image restoration benchmarks through a novel sampling strategy.
Abstract
Diffusion models provide strong generative priors for image restoration. Existing conditioning modules such as ControNet duplicate large parts of the denoising network, inflating parameter counts. The proposed Restoration Adapter inserts a lightweight conditioning block directly into the pre‑trained diffusion backbone, enabling photo‑realistic restoration while keeping trainable parameters low. The method works with both U‑Net (StableDiffusion XL) and DiT (StableDiffusion 3) backbones and includes a simple sampling modification that balances fidelity and diversity.
Introduction
Image restoration seeks to reconstruct high‑quality images from degraded observations. Early approaches used CNNs or Transformers to learn a direct low‑to‑high mapping. Generative priors based on GANs (e.g., StyleGAN) improved realism but struggled with unaligned data. Large‑scale latent diffusion models (LDM) combined with text conditioning have become the dominant priors because they model diverse data distributions effectively. Conditioning a pre‑trained diffusion model on a low‑quality (LQ) image requires an additional module. ControNet achieves this but at the cost of copying most of the original network, leading to heavy memory and compute overhead. Since LQ images already contain rich semantic cues, a smaller, integrated module can replace ControNet and still guide the diffusion prior efficiently.
Preliminaries
Diffusion Models
A diffusion model defines a forward stochastic process that gradually adds Gaussian noise to data and a reverse process that denoises step‑by‑step. The forward process is described by a stochastic differential equation (SDE); the reverse process can be implemented as an SDE or an ordinary differential equation (ODE) depending on the sampler.
Restoration Prior
A restoration prior is a pre‑trained generative model that generates a high‑quality image conditioned on an LQ input. This work adopts StableDiffusion XL (U‑Net backbone) and StableDiffusion 3 (MM‑DiT backbone), both trained on massive text‑image corpora. The output layer of each prior is zero‑initialized to guarantee training stability when fine‑tuning on a small high‑quality dataset.
Diffusion Restoration Adapter
Restoration Adapter
The adapter is placed after selected blocks of the denoising network. It receives three inputs: the LQ feature map, the timestep‑dependent denoising feature Xt, and the timestep embedding emb. In the U‑Net case, the first adapter encodes the LQ latent; subsequent adapters take the previous adapter’s output as the LQ feature. Processing steps:
LQ feature passes through a linear projection.
Embedding emb is linearly transformed and activated with SiLU to match the shape of Xt.
The projected LQ feature and transformed embedding are summed with Xt.
The sum flows through two residual adapter blocks.
A zero‑initialized linear layer produces the final residual, which is added back to Xt and forwarded to the next denoising block.
The DiT variant follows the same principle with minor architectural adjustments (e.g., different channel dimensions).
Diffusion Adapter
To fine‑tune the massive diffusion prior on a modest high‑quality dataset while keeping most weights frozen, low‑rank adaptation (LoRA) matrices are inserted into the self‑attention layers. These LoRA modules constitute the Diffusion Adapter and are trained jointly with the Restoration Adapter, providing efficient parameter updates without altering the bulk of the pre‑trained model.
Training Data and Objectives
Paired training data consist of high‑quality images and their synthetically degraded counterparts generated by a RealESRGAN‑style pipeline (blur, down‑sampling, compression, etc.). Because the priors are text‑image models, short captions are extracted for each image using a multimodal language model; when no caption is available, an empty string is used. The LQ image serves as the visual condition, while the degradation description (e.g., “low‑resolution”, “compressed”) is supplied as the textual prompt.
Training objectives follow the original diffusion losses: DDPM loss for StableDiffusion XL and conditional flow‑matching loss for StableDiffusion 3. All pre‑trained diffusion weights remain frozen; only the Restoration and Diffusion adapters are updated.
Restoration Sampling Strategy
Standard sampling schedulers prioritize diversity and overall quality, which can reduce fidelity in restoration tasks. The proposed plug‑and‑play strategy modifies the denoising direction at each timestep, similar to classifier guidance and SUPIR. At timestep t, compute the unnormalized direction from the current denoised latent z'_t toward the LQ latent c_{lq} and shift z'_t by a time‑dependent factor: z_t = z'_t + w·g(t, T)·(c_{lq} - z'_t) where T is the total number of sampling steps, w is a weighting hyper‑parameter, and g is a piecewise‑linear mapping that assigns larger weights to early timesteps and smaller weights later. The mapping is defined as:
g(t, T; a) = { (t/(T - a)) / (1 - a) if t/T > a else 0 } acontrols the breakpoint. This schedule forces the sampler to stay close to the LQ observation in early denoising stages (high fidelity) while allowing more freedom for quality and diversity in later stages.
Experiments
Quantitative Comparison
Table 2 (from the original paper) reports results on three cropping settings. For centrally‑cropped images, the StableDiffusion 3‑based method achieves the highest scores on all perceptual metrics (e.g., ClipIQA 0.6868, MUSIQ 70.56). StableDiffusion XL ranks second on ClipIQA (0.6517) and MUSIQ (71.92). On 8× down‑sampled images, SD‑3 again leads (ClipIQA 0.6659, MUSIQ 72.2) with SD‑XL close behind on MUSIQ. In random‑crop experiments, SD‑3 tops ClipIQA (0.6868) and MUSIQ (70.56); SD‑XL remains competitive on ManIQA (0.5358) and LPIPS (0.334) and is second on MUSIQ (70.49). PSNR/SSIM are lower for diffusion‑based methods, but visual fidelity is superior. Parameter counts are 157 M for SD‑XL and 80 M for SD‑3, whereas ControNet roughly doubles the original network size. Methods that feed LQ conditions through ControNet (DiffBIR, SeeSR, SUPIR) degrade when the prior grows (e.g., 2 B SD‑3), while the proposed adapters maintain performance with only 80 M trainable parameters.
Qualitative Comparison
Visual results on RealPhoto60 and DIV2K (1024×1024) show that the adapter‑based method preserves sharp edges and fine textures better than state‑of‑the‑art baselines SUPIR and SeeSR. Specific failures of the baselines include: SUPIR generates incorrect stone texture and blurs cat fur; SeeSR loses facial details and produces overall blur. The proposed method consistently renders detailed hat and clothing textures. Although PSNR/SSIM are modest, the perceptual quality demonstrates that these metrics do not fully capture restoration fidelity. Additional random‑crop experiments on DIV2K confirm that competing methods sometimes produce blurry or erroneous textures, whereas the adapter approach reliably yields high‑detail outputs.
Conclusion
The Diffusion Restoration Adapter framework combines two lightweight modules—Restoration Adapters inserted into the denoising backbone and LoRA‑based Diffusion Adapters for efficient fine‑tuning—to exploit large pre‑trained diffusion priors for real‑world image restoration. The design works with both U‑Net and DiT backbones, supports a simple fidelity‑aware sampling modification, and achieves competitive or superior perceptual metrics with far fewer trainable parameters than ControNet‑based pipelines.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
