Artificial Intelligence 20 min read

21 Ant Research Papers Shaping CVPR 2025: AI Image & Video Generation Breakthroughs

The Interactive Intelligence Lab of Ant Technology Research Institute presented 21 accepted CVPR 2025 papers covering visual generation, editing, 3D vision, digital humans and multimodal AI, highlighting tools such as MagicQuill, Lumos, Aurora, FLARE, LeviTor, MangaNinja, AniDoc, Mimir, AvatarArtist, DiffListener, MotionStone, TensorialGaussianAvatars, DualTalk, CompreCap and Uni-AD.

AntTech
AntTech
AntTech
21 Ant Research Papers Shaping CVPR 2025: AI Image & Video Generation Breakthroughs

From June 11–15, 2025 CVPR 2025 was held in Nashville, receiving 13,008 submissions (a 13% increase over the previous year) and accepting 2,878 papers, resulting in a 22.1% acceptance rate.

The Ant Technology Research Institute’s Interactive Intelligence Lab had 21 papers accepted, spanning visual generation, visual editing, 3D vision, digital humans and related research directions.

MagicQuill

MagicQuill is an interactive AI image‑editing tool that integrates an editing processor, a painting assistant and a creative collector. Users edit images with three intuitive magic brushes—add, delete and color—while a multimodal large language model dynamically predicts user intent and offers editing suggestions.

<code>Paper: https://arxiv.org/abs/2411.09703
Code: https://github.com/ant-research/MagicQuill
HuggingFace Demo: https://huggingface.co/spaces/AI4Editing/MagicQuill
ModelScope Demo: https://modelscope.cn/studios/ant-research/MagicQuill_demo</code>

Lumos

Lumos is a purely visual training framework for image‑to‑image (I2I) generation that demonstrates the feasibility and scalability of learning I2I models in a self‑supervised manner from wild images. The I2I model serves as a strong visual prior for text‑to‑image (T2I) tasks, achieving comparable or better performance with only one‑tenth of the text‑image pairs used for fine‑tuning.

Lumos also shows advantages on text‑independent visual generation tasks such as image‑to‑3D and image‑to‑video conversion.

<code>Paper: https://arxiv.org/abs/2412.07767
Code: https://github.com/ant-research/lumos</code>

Aurora

Aurora is a GAN‑based text‑to‑image generator that employs a mixture‑of‑experts (MoE) architecture with a sparse router to adaptively select the most suitable expert for each feature point, addressing the scaling challenges of GANs.

It uses a two‑stage training strategy: a base model is learned at 64×64 resolution, followed by an up‑sampler that generates 512×512 images, narrowing the performance gap with industrial diffusion models while maintaining fast inference.

<code>Paper: https://arxiv.org/abs/2309.03904
Code: https://github.com/ant-research/Aurora</code>

FLARE

FLARE is a forward‑inference model that estimates high‑quality camera pose and 3D geometry from 2–8 uncalibrated sparse‑view images. It adopts a cascaded learning paradigm where pose estimation guides subsequent geometry and appearance learning, achieving state‑of‑the‑art results with inference under 0.5 seconds.

<code>Paper: https://arxiv.org/abs/2502.12138
Code: https://github.com/ant-research/FLARE</code>

LeviTor

LeviTor introduces a depth‑aware 3D trajectory control for image‑to‑video synthesis. By abstracting object masks into clustered points enriched with depth and instance information, it enables precise control of object motion in three‑dimensional space, outperforming existing 2D drag‑based methods.

<code>Paper: https://arxiv.org/abs/2412.15214
Code: https://github.com/ant-research/LeviTor</code>

MangaNinja

MangaNinja focuses on reference‑guided line‑art coloring. It employs a patch‑shuffle module to align reference color images with target line drawings and a point‑driven control scheme for fine‑grained color matching, achieving superior coloring accuracy on a self‑collected benchmark.

<code>Paper: https://arxiv.org/abs/2501.08332
Code: https://github.com/ali-vilab/MangaNinjia</code>

AniDoc

AniDoc is a video line‑art coloring tool built on video diffusion models. It automatically converts sketch sequences into colored animation using reference character designs, and can synthesize intermediate frames from a single character image and start/end sketches, dramatically reducing manual labor.

<code>Paper: https://arxiv.org/pdf/2412.14173
Code: https://github.com/ant-research/AniDoc</code>

Mimir

Mimir addresses the mismatch between large language model (LLM) outputs and existing text‑to‑video (T2V) pipelines by introducing a Token Fuser that harmonizes LLM embeddings with a video encoder, enabling high‑quality video generation with strong text understanding, especially for short subtitles and motion handling.

<code>Paper: https://arxiv.org/abs/2412.03085
Project page: https://lucaria-academy.github.io/Mimir</code>

AvatarArtist

AvatarArtist generates animatable 3D avatars from a single portrait image, supporting arbitrary styles. It fuses image diffusion priors with a 4D GAN, using parametric triplanes to represent 4D data and a DiT‑based model to predict these triplanes, while a neural renderer preserves identity across styles.

<code>Paper: https://arxiv.org/abs/240x.xxxxx
Code: https://github.com/ant-research/AvatarArtist</code>

DiffListener

DiffListener generates realistic listener facial feedback (e.g., nods, frowns) from speaker audio and motion cues using high‑resolution diffusion rendering (512×512). It combines a mixed motion modeling module with implicit motion enhancement and pose‑specific controls.

<code>Paper: https://arxiv.org/abs/2412.xxxxy
Code: https://github.com/ant-research/DiffListener</code>

MotionStone

MotionStone introduces a decoupled motion estimator that measures object‑level and camera‑level motion intensity, enabling robust I2V generation. The estimator is trained via contrastive learning on randomly paired videos, and the resulting MotionStone model achieves state‑of‑the‑art performance on image‑to‑video synthesis.

<code>Paper: https://arxiv.org/abs/2412.05848</code>

TensorialGaussianAvatars

TensorialGaussianAvatars encodes 3D Gaussian texture attributes into a compact tensor format, storing neutral‑face appearance in static triplanes and dynamic expression details in lightweight 1‑D feature lines, achieving real‑time rendering with low storage while preserving facial dynamics.

<code>Paper: https://arxiv.org/abs/2412.xxxxy
Code: https://github.com/ant-research/TensorialGaussianAvatars</code>

DualTalk

DualTalk tackles 3D facial motion generation for dialogue scenes, jointly modeling speaking and listening behaviors to ensure smooth role transitions. It introduces a 50‑hour multi‑turn dialogue dataset with 1,000 identities and releases code and data for the community.

<code>Paper: https://arxiv.org/abs/2412.xxxxy
Code: https://github.com/ant-research/DualTalk</code>

CompreCap

CompreCap is a detailed caption benchmark that evaluates large vision‑language models (LVLMs) using directed scene graphs. It measures object‑level coverage, attribute accuracy, and relational scores, showing strong correlation with human judgments.

<code>Paper: https://arxiv.org/abs/2412.08614
Code: https://github.com/LuFan31/CompreCap</code>

Uni‑AD

Uni‑AD is a unified framework for audio description generation, leveraging a pretrained multimodal backbone with interleaved video‑text sequences, a lightweight video‑text feature mapper, and a role‑optimization module to produce fluent, context‑aware narrations for visually impaired audiences.

<code>Paper: https://arxiv.org/abs/2403.12922</code>
computer visionvideo generationgenerative AIimage editingmultimodal modelsCVPR2025
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.