iDetex: The Winning AI Model Transforming Image Quality Assessment
iDetex, the champion solution of the ICCV 2025 MIPI Detailed Image Quality Assessment Challenge, introduces a novel multimodal LLM-driven framework that precisely locates, describes, and grades image distortions, outperforming traditional IQA models and enabling practical deployments across video, live streaming, e‑commerce, and image‑processing pipelines.
Introduction
In the ICCV 2025 MIPI Detailed Image Quality Assessment Challenge, the IH‑VQA team from WeChat Test Center won the championship with their novel iDetex solution, setting a new industry standard for fine‑grained image quality evaluation and driving practical deployments in video, short‑video, live streaming, and e‑commerce services.
Task Background
Image Quality Assessment (IQA) seeks to build models that reflect human visual system perception. Traditional IQA models provide only a single overall score, lacking interpretability and fine‑grained analysis of distortion types, locations, and their impact on visual perception.
To advance IQA toward explainable intelligence, the Douyin Multimedia Quality Lab and the Basic Experience Algorithm team co‑organized a Detailed Image Quality Assessment track at the fourth ICCV MIPI Workshop, encouraging the use of multimodal large language models (MLLMs) for precise distortion localization, multi‑dimensional perception, and causal reasoning.
Dataset and Competition
The competition used the ViDA‑UGC dataset, which consists of two parts: metadata (11,058 images with overall quality grades, resolution, and detailed distortion annotations) and instruction‑fine‑tuning data (~534 K entries) covering three dimensions: Description, Perception, and Grounding.
Perception : 2,567 multiple‑choice questions evaluated by Perception Accuracy.
Grounding : Two sub‑tasks (distortion bounding‑box detection and region‑wise distortion identification) evaluated by mAP.
Description : Four‑step answer format—brief description, distortion localization and impact analysis, overall quality analysis, and final quality grade.
iDetex Architecture
The iDetex pipeline first extracts visual tokens with a visual encoder, then feeds these tokens together with a system prompt into a large language model. Guided by the prompt, the LLM performs a chain‑of‑thought reasoning process: (1) brief image description, (2) distortion localization and detailed analysis, (3) identification of key distortions affecting overall perception, and (4) generation of an overall quality rating. The detected distortions are visualized on the original image for user inspection.
Grounding Enhancement – Spatial Perturbation
To improve robustness in distortion localization, random cropping and horizontal flipping are applied. Bounding‑box coordinates undergo corresponding affine transformations to keep annotations valid, enriching spatial diversity and encouraging the model to focus on intrinsic distortion patterns rather than absolute positions.
Perception Enhancement – Query Style Alignment
The perception task is multiple‑choice. By analyzing the style of test‑set questions and generating training questions with matching style using metadata, the model’s query distribution aligns with the test distribution, reducing confusion caused by wording differences and improving accuracy.
Description Enhancement – Fine‑Grained Scoring
The original description task combined distortion localization, key‑distortion identification, and overall quality assessment, causing task interference. We decoupled the overall quality evaluation by reusing the Perception prompt, while keeping the original prompts for distortion questions. Additionally, we refined the quality label granularity from a 5‑level scale (bad, poor, fair, good, excellent) to a 10‑level scale (a‑j) using a linear mapping, then mapped predictions back to the original 5‑level scale for compatibility.
Data Mixing & Global Augmentation
Rather than training separate models for each sub‑task, we performed joint multi‑task fine‑tuning. Spatial‑perturbation data replaced 15‑45% of original grounding data, query‑style aligned data fully replaced original perception data, and fine‑grained description data replaced the original description data. This mixed dataset, combined with a strong visual encoder (e.g., InternVL3) and higher‑resolution inputs (up to 2048×2048), yielded superior performance across all metrics.
Business Deployment
Compared with traditional IQA models that output a single score and brief description, iDetex provides diagnostic reports: precise distortion types, localized bounding boxes, impact analysis, and a comprehensive quality rating. This multi‑dimensional insight transforms scoring into actionable guidance.
Applications include:
Content creation (image, short video, live streaming) : Automatic feedback such as “face region blurred” or “dark segment noisy” helps creators improve cover images and video quality.
Quality loss analysis : Pinpoint where quality degradation occurs in the pipeline (capture, transcoding, transmission) and quantify issues like edge sharpness loss.
E‑commerce : Real‑time inspection of product images to detect issues (blurred edges, low brightness) and guide merchants to correct them before upload.
Results and Awards
The IH‑VQA team achieved first place, leading in Perception Accuracy (+4%), Region mAP (+4%), Distortion mAP (+6%), and Image Quality Accuracy (+2%). Their solution has been accepted as a paper at the ICCV 2025 Workshop.
Acknowledgements
Team members: Sun Jianhui, Shao Tao, Yue Xinli, Xie Yuhui, Zhao Zhaoran.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
