Artificial Intelligence 20 min read

21 Ant Research Papers Shaping CVPR 2025: AI Image & Video Generation Breakthroughs

The Interactive Intelligence Lab of Ant Technology Research Institute presented 21 accepted CVPR 2025 papers covering visual generation, editing, 3D vision, digital humans and multimodal AI, highlighting tools such as MagicQuill, Lumos, Aurora, FLARE, LeviTor, MangaNinja, AniDoc, Mimir, AvatarArtist, DiffListener, MotionStone, TensorialGaussianAvatars, DualTalk, CompreCap and Uni-AD.

AntTech

Jun 15, 2025

21 Ant Research Papers Shaping CVPR 2025: AI Image & Video Generation Breakthroughs

From June 11–15, 2025 CVPR 2025 was held in Nashville, receiving 13,008 submissions (a 13% increase over the previous year) and accepting 2,878 papers, resulting in a 22.1% acceptance rate.

The Ant Technology Research Institute’s Interactive Intelligence Lab had 21 papers accepted, spanning visual generation, visual editing, 3D vision, digital humans and related research directions.

MagicQuill

MagicQuill is an interactive AI image‑editing tool that integrates an editing processor, a painting assistant and a creative collector. Users edit images with three intuitive magic brushes—add, delete and color—while a multimodal large language model dynamically predicts user intent and offers editing suggestions.

Paper: https://arxiv.org/abs/2411.09703
Code: https://github.com/ant-research/MagicQuill
HuggingFace Demo: https://huggingface.co/spaces/AI4Editing/MagicQuill
ModelScope Demo: https://modelscope.cn/studios/ant-research/MagicQuill_demo

Lumos

Lumos is a purely visual training framework for image‑to‑image (I2I) generation that demonstrates the feasibility and scalability of learning I2I models in a self‑supervised manner from wild images. The I2I model serves as a strong visual prior for text‑to‑image (T2I) tasks, achieving comparable or better performance with only one‑tenth of the text‑image pairs used for fine‑tuning.

Lumos also shows advantages on text‑independent visual generation tasks such as image‑to‑3D and image‑to‑video conversion.

Paper: https://arxiv.org/abs/2412.07767
Code: https://github.com/ant-research/lumos

Aurora

Aurora is a GAN‑based text‑to‑image generator that employs a mixture‑of‑experts (MoE) architecture with a sparse router to adaptively select the most suitable expert for each feature point, addressing the scaling challenges of GANs.

It uses a two‑stage training strategy: a base model is learned at 64×64 resolution, followed by an up‑sampler that generates 512×512 images, narrowing the performance gap with industrial diffusion models while maintaining fast inference.

Paper: https://arxiv.org/abs/2309.03904
Code: https://github.com/ant-research/Aurora

FLARE

FLARE is a forward‑inference model that estimates high‑quality camera pose and 3D geometry from 2–8 uncalibrated sparse‑view images. It adopts a cascaded learning paradigm where pose estimation guides subsequent geometry and appearance learning, achieving state‑of‑the‑art results with inference under 0.5 seconds.

Paper: https://arxiv.org/abs/2502.12138
Code: https://github.com/ant-research/FLARE

LeviTor

LeviTor introduces a depth‑aware 3D trajectory control for image‑to‑video synthesis. By abstracting object masks into clustered points enriched with depth and instance information, it enables precise control of object motion in three‑dimensional space, outperforming existing 2D drag‑based methods.

Paper: https://arxiv.org/abs/2412.15214
Code: https://github.com/ant-research/LeviTor

MangaNinja

MangaNinja focuses on reference‑guided line‑art coloring. It employs a patch‑shuffle module to align reference color images with target line drawings and a point‑driven control scheme for fine‑grained color matching, achieving superior coloring accuracy on a self‑collected benchmark.

Paper: https://arxiv.org/abs/2501.08332
Code: https://github.com/ali-vilab/MangaNinjia

AniDoc

AniDoc is a video line‑art coloring tool built on video diffusion models. It automatically converts sketch sequences into colored animation using reference character designs, and can synthesize intermediate frames from a single character image and start/end sketches, dramatically reducing manual labor.

Paper: https://arxiv.org/pdf/2412.14173
Code: https://github.com/ant-research/AniDoc

Mimir

Mimir addresses the mismatch between large language model (LLM) outputs and existing text‑to‑video (T2V) pipelines by introducing a Token Fuser that harmonizes LLM embeddings with a video encoder, enabling high‑quality video generation with strong text understanding, especially for short subtitles and motion handling.

Paper: https://arxiv.org/abs/2412.03085
Project page: https://lucaria-academy.github.io/Mimir

AvatarArtist

AvatarArtist generates animatable 3D avatars from a single portrait image, supporting arbitrary styles. It fuses image diffusion priors with a 4D GAN, using parametric triplanes to represent 4D data and a DiT‑based model to predict these triplanes, while a neural renderer preserves identity across styles.

Paper: https://arxiv.org/abs/240x.xxxxx
Code: https://github.com/ant-research/AvatarArtist

DiffListener

DiffListener generates realistic listener facial feedback (e.g., nods, frowns) from speaker audio and motion cues using high‑resolution diffusion rendering (512×512). It combines a mixed motion modeling module with implicit motion enhancement and pose‑specific controls.

Paper: https://arxiv.org/abs/2412.xxxxy
Code: https://github.com/ant-research/DiffListener

MotionStone

MotionStone introduces a decoupled motion estimator that measures object‑level and camera‑level motion intensity, enabling robust I2V generation. The estimator is trained via contrastive learning on randomly paired videos, and the resulting MotionStone model achieves state‑of‑the‑art performance on image‑to‑video synthesis.

Paper: https://arxiv.org/abs/2412.05848

TensorialGaussianAvatars

TensorialGaussianAvatars encodes 3D Gaussian texture attributes into a compact tensor format, storing neutral‑face appearance in static triplanes and dynamic expression details in lightweight 1‑D feature lines, achieving real‑time rendering with low storage while preserving facial dynamics.

Paper: https://arxiv.org/abs/2412.xxxxy
Code: https://github.com/ant-research/TensorialGaussianAvatars

DualTalk

DualTalk tackles 3D facial motion generation for dialogue scenes, jointly modeling speaking and listening behaviors to ensure smooth role transitions. It introduces a 50‑hour multi‑turn dialogue dataset with 1,000 identities and releases code and data for the community.

Paper: https://arxiv.org/abs/2412.xxxxy
Code: https://github.com/ant-research/DualTalk

CompreCap

CompreCap is a detailed caption benchmark that evaluates large vision‑language models (LVLMs) using directed scene graphs. It measures object‑level coverage, attribute accuracy, and relational scores, showing strong correlation with human judgments.

Paper: https://arxiv.org/abs/2412.08614
Code: https://github.com/LuFan31/CompreCap

Uni‑AD

Uni‑AD is a unified framework for audio description generation, leveraging a pretrained multimodal backbone with interleaved video‑text sequences, a lightweight video‑text feature mapper, and a role‑optimization module to produce fluent, context‑aware narrations for visually impaired audiences.

Paper: https://arxiv.org/abs/2403.12922

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision video generation Generative AI image editing multimodal models CVPR2025

Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.