How Bilibili’s Multimodal Team Won 2nd Place at ICCV MIPI with a Novel SFT+GRPO Strategy

This article details how Bilibili’s multimedia lab leveraged a multimodal training pipeline combining data‑compressed SFT and the GRPO reinforcement‑learning algorithm to achieve a 13.5% metric boost and secure second place in the ICCV MIPI Detailed Image Quality Assessment competition.

Bilibili Tech
Bilibili Tech
Bilibili Tech
How Bilibili’s Multimodal Team Won 2nd Place at ICCV MIPI with a Novel SFT+GRPO Strategy

Introduction

During the summer, Bilibili’s multimedia lab led a team in the ICCV MIPI (Mobile Intelligent Photography and Imaging) Workshop’s Detailed Image Quality Assessment track, introducing an innovative multimodal training strategy that raised the overall metric by 13.5% and earned second place.

Background

Since the launch of BILIVQA 2.0 in autumn 2023, the lab has been improving the synergy between video quality assessment (VQA) and video image processing, aiming for a full‑chain system of pre‑analysis, automated processing, and result verification. Real‑world usage revealed diverse distortion types and locations, making a single MOS insufficient for guiding processing. Consequently, the lab began exploring multimodal large language models (MLLM) for fine‑grained semantic and low‑level quality analysis.

State‑of‑the‑art open‑source multimodal models such as Qwen‑VL and InternVL offer flexible image size handling and temporal position encoding for video, but lack sufficient quality‑assessment data during pre‑training. Existing quality datasets provide only MOS labels, missing detailed distortion annotations, and manual labeling is costly. An unsupervised or weakly supervised method that can generate chain‑of‑thought (CoT) explanations while fitting MOS would be highly valuable, leading the team to investigate the GRPO algorithm.

GRPO Algorithm Overview

GRPO (Group Relative Policy Optimization) is a simplified version of PPO proposed by Deepseek for resource‑constrained reinforcement learning. It samples G responses from a high‑temperature prompt, uses a reward function instead of a reward model to compute advantages, and updates the policy model without a critic or separate reward model. Although this can cause sparse rewards, it remains effective for many semantic tasks.

Applying GRPO to MLLM, the team aimed to enable the model to generate CoT explanations for video quality analysis. Initial experiments with small models (< 7B parameters) showed reasonable CoT emergence but limited impact on quantitative metrics such as model‑selection accuracy.

Competition Introduction

The ICCV MIPI workshop offered three challenges; the detailed image quality assessment track required MLLM to understand content, predict MOS, identify distortion types, and locate distortion regions. With only one month to prepare, the team built on their proven SFT+GRPO pipeline and added a “data compression + hard‑sample mining” approach.

Data‑Compressed SFT

The provided training set contained over 10,000 MOS‑annotated images, each enriched with GPT‑generated semantic, distortion type, and location descriptions, expanding to 550,000 prompts across three categories (description, localization, perception). To reduce training time, the team merged multiple Q‑A pairs per image into a single prompt, shrinking the dataset to ~110,000 entries and cutting training from a week to two days without significant performance loss.

SFT served to quickly adapt the pretrained model to the domain and mitigate over‑fitting before GRPO fine‑tuning.

Hard‑Sample Mining with GRPO

GRPO relies on output diversity; as the policy converges, variance drops and “entropy collapse” can occur. To counter this, the team performed a forward pass with the SFT model, selected moderately difficult samples (neither all‑correct nor all‑incorrect), and fed them to GRPO. For perception tasks, they aggregated partially correct prompts and adjusted the reward based on the number of correct answers, preventing entropy collapse.

During GRPO training, the vision encoder of Qwen2.5‑VL 7B was frozen while the language model parameters were updated.

Competition Results

The combined SFT+GRPO pipeline topped the development leaderboard and secured second place on the final leaderboard. Compared to the baseline Qwen2‑VL 7B SFT model, data‑compressed SFT improved the score by 0.22 points, and hard‑sample GRPO added another 0.13 points, reaching a total of 2.86.

Visualizations showed the model’s ability to describe image content, identify three distortion types (edge aliasing, under‑exposure, low sharpness), and highlight the most severe issue.

Future Outlook

The team plans to build a full‑chain intelligent video processing system comprising content & distortion analysis, automated video image processing, and effect evaluation, all driven by MLLM. While the current pipeline uses rule‑based model selection, future work will let a trained MLLM handle model selection and feedback loops, further enhancing efficiency in video quality workflows.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

reinforcement learningSFTvideo quality assessmentGRPOmultimodal LLMMIPI competition
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.