Top 8 Tencent Youtu Papers Accepted at ICCV 2025: Innovations in AI and Vision

The 20th ICCV conference announced 8 papers from Tencent Youtu Lab covering stylized face recognition, AI‑generated image detection, heterogeneous knowledge distillation, multi‑conditional diffusion, multimodal LLM distillation, palmprint recognition, low‑light vision, and oracle bone script decipherment, each pushing the frontier of computer vision and AI research.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
Top 8 Tencent Youtu Papers Accepted at ICCV 2025: Innovations in AI and Vision

1. Stylized‑Face: A Million‑Level Stylized Face Dataset for Face Recognition

Stylized‑Face provides a large‑scale dataset for stylized face recognition, containing 4.6 million images of 62 000 identities across diverse artistic styles such as anime, painting, and cyber‑punk. A semi‑automatic cleaning pipeline removes duplicates, verifies labels, and filters low‑quality samples. Three benchmark subsets are defined to evaluate (i) intra‑distribution performance, (ii) cross‑method generalization, and (iii) cross‑style robustness. Models trained on this dataset achieve a 15.9 % increase in true‑accept‑rate (TAR) at FAR = 1e‑4 and a 13.3 % gain in cross‑method TAR at FAR = 1e‑3 compared with prior baselines.

Stylized‑Face dataset illustration
Stylized‑Face dataset illustration

2. AIGI‑Holmes: Explainable and Generalizable AI‑Generated Image Detection via Multimodal Large Language Models

The work addresses two gaps in AI‑generated image (AIGI) detection: lack of human‑verifiable explanations and limited out‑of‑distribution (OOD) generalization to next‑generation multimodal generative models. A comprehensive dataset, Holmes‑Set, is built with two components: Holmes‑SFTSet for instruction‑fine‑tuning and Holmes‑DPOSet for human‑aligned preference data. A novel multi‑expert review mechanism leverages structured responses from multimodal large language models (MLLMs) to improve data quality. The training pipeline, called Holmes Pipeline, consists of visual expert pre‑training, supervised fine‑tuning (SFT), and direct preference optimization (DPO). During inference, a collaborative decoding strategy fuses visual expert logits with MLLM semantic reasoning, boosting OOD detection performance. Extensive experiments on three benchmarks demonstrate superior detection accuracy and more interpretable explanations.

AIGI‑Holmes architecture diagram
AIGI‑Holmes architecture diagram

3. Fuse Before Transfer: Knowledge Fusion for Heterogeneous Distillation

Most knowledge‑distillation (KD) methods assume homogeneous teacher‑student architectures, which limits performance when the networks differ. Fuse Before Transfer proposes to fuse teacher knowledge before transfer by directly integrating convolution, attention, and MLP modules of teacher and student, thereby aligning their inductive biases. To handle spatial heterogeneity, the method replaces pixel‑wise MSE with a spatially smoothed InfoNCE loss for feature alignment. Evaluations on CIFAR‑100 and ImageNet‑1K with CNN, ViT, and MLP teachers and various students show up to 11.47 % improvement on CIFAR‑100 and 3.67 % on ImageNet‑1K.

Fusion architecture illustration
Fusion architecture illustration

4. UniCombine: Unified Multi‑Conditional Generation with Diffusion Transformer

UniCombine tackles the challenge of jointly conditioning diffusion‑based image synthesis on multiple signals (e.g., text prompts, spatial maps, subject images). Built on the Diffusion Transformer (DiT), it introduces a novel MMDiT attention mechanism that simultaneously attends to all condition tokens, and LoRA‑based trainable adapters for parameter‑efficient fine‑tuning. Two variants are offered: a training‑free version that uses frozen adapters and a training‑based version that learns the adapters. The authors also release SubjectSpatial200K, a 200 K‑sample dataset covering subject‑driven and spatial‑alignment conditions. Experiments across diverse multi‑conditional generation tasks achieve state‑of‑the‑art performance.

UniCombine generation results
UniCombine generation results

5. LLaVA‑KD: A Framework for Distilling Multimodal Large Language Models

LLaVA‑KD defines a three‑stage pipeline to transfer knowledge from large‑scale multimodal LLMs (l‑MLLMs) to smaller, resource‑efficient counterparts (s‑MLLMs):

Multimodal Distillation (MDist) : aligns visual‑language embeddings between teacher and student.

Relation Distillation (RDist) : transfers the teacher’s ability to capture relationships among visual tokens.

The training proceeds through:

Distillation pre‑training to align embeddings.

Supervised fine‑tuning for multimodal understanding.

Distillation fine‑tuning to refine the student while keeping the architecture unchanged.

Extensive experiments on multiple benchmarks show consistent performance gains over baseline s‑MLLMs. Paper link: https://arxiv.org/abs/2410.16236

LLaVA‑KD pipeline diagram
LLaVA‑KD pipeline diagram

6. Unified Adversarial Augmentation for Improving Palmprint Recognition

This work introduces a unified adversarial augmentation framework to improve palmprint recognizers under geometric deformation and texture degradation.

Adversarial training incorporates recognizer feedback to generate challenging samples.

A spatial‑transform module applies geometric perturbations, while a novel identity‑preserving texture module introduces realistic texture variations without breaking identity.

A dynamic sampling strategy selects the most informative adversarial examples, improving training efficiency.

Experiments on both constrained and challenging palmprint datasets demonstrate superior recognition accuracy compared with conventional augmentation techniques.

Palmprint adversarial augmentation results
Palmprint adversarial augmentation results

7. From Enhancement to Understanding: Semantically Consistent Unsupervised Fine‑tuning for Low‑Light Vision

The paper proposes a unified bridge that simultaneously enhances low‑light images and improves downstream task performance.

A pretrained diffusion model provides zero‑shot image enhancement.

Semantic‑consistent unsupervised fine‑tuning introduces image prompts that explicitly encode perceived illumination.

A cyclic attention adapter maximizes the semantic potential of the enhanced images.

Two consistency losses are employed: (i) image‑description consistency to preserve high‑level semantics, and (ii) reflection consistency to maintain spatial coherence.

Results achieve state‑of‑the‑art gains on low‑light classification, detection, and segmentation benchmarks.

Low‑light enhancement and understanding results
Low‑light enhancement and understanding results

8. OracleFusion: Assisting Oracle Bone Script Decipherment with Structurally Constrained Semantic Typography

OracleFusion presents a two‑stage semantic typography framework for oracle bone script analysis.

Stage 1 employs a multimodal LLM enhanced with spatial‑aware reasoning (SAR) to locate key glyph components.

Stage 2 fuses structural vector constraints (SOVF) to generate semantically rich vector fonts while preserving glyph geometry.

Qualitative and quantitative evaluations demonstrate superior readability, aesthetic quality, and the ability to provide expert‑level insights for unseen characters. Paper link: https://arxiv.org/abs/2506.21101

OracleFusion glyph generation
OracleFusion glyph generation

Code example

收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!
Artificial Intelligencecomputer visionknowledge distillationdatasetICCV 2025Low‑light Vision
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.