Artificial Intelligence 15 min read

High-Fidelity Image-to-Video Generation for E‑commerce Product Motion with AtomoVideo and Noise Rectification

This article presents Alibaba's research on using diffusion‑based AIGC techniques, including a training‑free Noise Rectification module and the AtomoVideo model, to automatically convert static product images into high‑quality, detail‑preserving video motions for e‑commerce advertising.

DataFunTalk

Mar 18, 2024

High-Fidelity Image-to-Video Generation for E‑commerce Product Motion with AtomoVideo and Noise Rectification

1. Overview

In today’s e‑commerce landscape, video content is becoming a crucial channel for product marketing due to its vivid visual experience and rapid information delivery, yet producing high‑quality videos remains costly and labor‑intensive. Recent advances in AI‑generated content (AIGC) and generative diffusion models make it possible to batch‑produce high‑quality video creatives, and the release of OpenAI’s Sora has highlighted the potential of intelligent video creation. Alibaba’s Mama Intelligent Creation and AI Application team has been researching video AIGC for nearly a year, culminating in tools such as the Size Cube and product video motion generation based on diffusion models.

2. Core Technology

The generation of product video motion faces the challenge of preserving the exact appearance of the product while adding realistic motion, a requirement that is critical for merchants. To address this, we built on existing text‑to‑video (T2V) models and introduced a training‑free Noise Rectification module and an upgraded I2V model called AtomoVideo.

2.1 Noise Rectification: Training‑Free Noise Corrector

While text‑to‑image generation has progressed rapidly, video generation lags behind due to higher resource demands. We propose a Noise Rectification approach that adds a controlled amount of noise to the input image to simulate the training‑time noising process, thereby obtaining a noise prior that retains image information. Directly applying this “image‑padding” T2V method to product motion generation leads to loss of fine details. Our Noise Rectification module corrects the added noise by comparing the model‑predicted noise with the true noise and adjusting the prediction, eliminating the first‑frame error and ensuring temporal consistency across frames without any additional training.

More technical details can be found in our paper:

Title: Tuning‑Free Noise Rectification for High Fidelity Image‑to‑Video Generation

Link: https://arxiv.org/abs/2403.02827

Project homepage: https://noise-rectification.github.io/

The following figure shows the Noise Rectification workflow and the before‑after comparison, demonstrating significantly improved fidelity to the original product image.

2.2 AtomoVideo: High‑Fidelity I2V Model Upgrade

To further improve temporal consistency and image fidelity, we developed AtomoVideo, an I2V model specifically tuned for e‑commerce product motion. The model incorporates several enhancements:

High‑quality dataset construction: We collected millions of text‑video pairs, filtered for visual appeal, textual relevance, and product‑centric content, and collaborated with designers to create a premium dataset of high‑resolution videos.

Multi‑granularity image injection: Both low‑level and high‑level image semantics are injected into the diffusion model, preserving fine product details while ensuring temporal coherence.

Progressive motion intensity training: A multi‑stage training regime gradually increases motion intensity, allowing the model to generate larger motions without sacrificing stability.

After large‑scale training, the upgraded model can generate 4‑second, 720p videos. Details are described in our technical report:

Title: AtomoVideo: High Fidelity Image‑to‑Video Generation

Link: http://arxiv.org/abs/2403.01800

Project homepage: https://atomo-video.github.io

We also fixed the original T2I backbone parameters and only trained the new temporal modeling and input layers, enabling seamless integration with community‑wide ControlNet for localized control and better product detail preservation.

2.3 Motion Scene Template Adaptation

Since many product images lack inherent motion cues, we co‑created a set of video motion scene templates (e.g., "mountain clouds", "underwater world") with designers, providing specialized dynamic descriptions for each scenario. During major promotional events, dedicated festive templates are also deployed.

Scene Template Name

AIGC Image

Motion Video

Pink Fireworks

Pink World

3. Business Applications

The product video motion generation service is now live on Alibaba Mama’s Wanxiang Lab and advertising platform, allowing advertisers to generate motion videos on‑demand. Below are several examples of input images and the corresponding generated videos.

4. Conclusion and Outlook

In the past six months, AIGC video generation has made remarkable strides with models such as GEN‑2, PIKA 1.0, and Sora, ushering a creative revolution in media production. Our work demonstrates how diffusion‑based video generation, combined with controllable techniques, can bring static e‑commerce images to life, achieving practical deployment in advertising.

Nevertheless, challenges remain in video stability, controllability, and length. Future research will focus on improving these aspects and exploring broader business scenarios, inspired by emerging technologies like Diffusion Transformers and scaling‑up methods.

About Alibaba Mama

We are the Alibaba Mama Intelligent Creation and AI Application team, dedicated to AI‑driven generation of images, videos, and copy for various business lines. We welcome collaborations and talent with CV/NLP backgrounds; interested candidates can send their resumes to [email protected].

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

diffusion model AIGC image-to-video AtomoVideo Noise Rectification

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.