XPSR: Cross‑modal Priors for Diffusion‑based Image Super‑Resolution

The paper introduces XPSR, a diffusion‑based image super‑resolution method that incorporates cross‑modal semantic priors from a large multimodal language model, achieving state‑of‑the‑art performance on both reference and no‑reference quality metrics across synthetic and real‑world video restoration tasks.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
XPSR: Cross‑modal Priors for Diffusion‑based Image Super‑Resolution

At ECCV 2024, Kuaishou Audio‑Video Technology and Tsinghua University presented XPSR, a diffusion‑based image super‑resolution method that leverages cross‑modal priors generated by a large multimodal language model.

Video and image restoration are increasingly important; previous GAN‑based methods struggle with fine texture and subjective quality, while diffusion models have shown impressive generative capabilities.

The XPSR framework consists of two stages: (1) a multimodal LLM produces semantic descriptions of the low‑resolution image; (2) the low‑resolution image and the semantic information are fed into a diffusion UNet, where a novel Semantic‑Fusion Attention (SFA) merges parallel cross‑attention streams to balance object and quality cues.

Semantic description generation, state‑information fusion, degradation‑free constraints, and an optimization objective are detailed, with equations such as x_{\textit{lr}}, z_{\textit{hr}}^t, c_h, c_l incorporated.

During training, a degradation‑free constraint aligns LR and HR features at multiple scales, and classifier‑free guidance with negative prompts (e.g., “blurry, dotted, noise, unclear, low‑res, over‑smoothed”) improves visual fidelity.

Extensive experiments on synthetic and real‑world datasets show that XPSR outperforms existing GAN‑based and diffusion‑based baselines on both reference metrics (PSNR, SSIM, LPIPS, DISTS, FID) and no‑reference metrics (MANIQA, CLIPIQA, MUSIQ), as illustrated in Tables 1 and 2 and the accompanying visual comparisons.

The authors conclude that XPSR achieves state‑of‑the‑art performance and will continue to support Kuaishou’s video enhancement pipeline, with future work aimed at broader applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

diffusion modelAI researchECCV2024cross‑modal priorsimage super‑resolution
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.