Artificial Intelligence 8 min read

XPSR: Cross‑modal Priors for Diffusion‑based Image Super‑Resolution

The paper introduces XPSR, a diffusion‑based image super‑resolution method that incorporates cross‑modal semantic priors from a large multimodal language model, achieving state‑of‑the‑art performance on both reference and no‑reference quality metrics across synthetic and real‑world video restoration tasks.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
XPSR: Cross‑modal Priors for Diffusion‑based Image Super‑Resolution

At ECCV 2024, Kuaishou Audio‑Video Technology and Tsinghua University presented XPSR, a diffusion‑based image super‑resolution method that leverages cross‑modal priors generated by a large multimodal language model.

Video and image restoration are increasingly important; previous GAN‑based methods struggle with fine texture and subjective quality, while diffusion models have shown impressive generative capabilities.

The XPSR framework consists of two stages: (1) a multimodal LLM produces semantic descriptions of the low‑resolution image; (2) the low‑resolution image and the semantic information are fed into a diffusion UNet, where a novel Semantic‑Fusion Attention (SFA) merges parallel cross‑attention streams to balance object and quality cues.

Semantic description generation, state‑information fusion, degradation‑free constraints, and an optimization objective are detailed, with equations such as x_{\textit{lr}} , z_{\textit{hr}}^t , c_h , c_l incorporated.

During training, a degradation‑free constraint aligns LR and HR features at multiple scales, and classifier‑free guidance with negative prompts (e.g., “blurry, dotted, noise, unclear, low‑res, over‑smoothed”) improves visual fidelity.

Extensive experiments on synthetic and real‑world datasets show that XPSR outperforms existing GAN‑based and diffusion‑based baselines on both reference metrics (PSNR, SSIM, LPIPS, DISTS, FID) and no‑reference metrics (MANIQA, CLIPIQA, MUSIQ), as illustrated in Tables 1 and 2 and the accompanying visual comparisons.

The authors conclude that XPSR achieves state‑of‑the‑art performance and will continue to support Kuaishou’s video enhancement pipeline, with future work aimed at broader applications.

diffusion modelvideo processingAI researchECCV2024Image Super-Resolutioncross‑modal priors
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.