Artificial Intelligence 21 min read

Understanding Motion Decomposition in AI-Based Image Animation: From Sparse to Dense Optical Flow

The article details how AI‑based image animation and face‑swapping decompose video motion into zero‑order rigid and first‑order affine components via Taylor expansion, using unsupervised U‑Net keypoint extraction, sparse-to-dense optical flow conversion, and dense motion networks that learn masks for region‑wise rigidity and non‑rigid deformation.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Understanding Motion Decomposition in AI-Based Image Animation: From Sparse to Dense Optical Flow

This article provides a comprehensive technical analysis of AI-based image animation and face swapping technologies, focusing on motion decomposition using Taylor expansion. The author discusses how video motion information can be decomposed into zero-order and first-order motion components, which provides valuable optical flow information for image animation—a key method for face swapping technology.

The article begins by introducing the concept of image animation, where a static image is animated to follow the movements in a driving video. The core challenge lies in three aspects: how to characterize motion information, how to extract motion from the driving video, and how to apply the extracted motion to deform the static image.

The author explains that motion information is typically represented using dense optical flow maps, which describe pixel-level movement between consecutive frames. However, in practice, obtaining dense optical flow maps directly is challenging, so the approach uses sparse optical flow maps (keypoint correspondences) as a foundation.

A significant portion of the article focuses on unsupervised keypoint extraction using U-net architectures. The method extracts confidence maps for keypoints and fits Gaussian distributions to obtain keypoint center positions and variances. This approach enables generalization beyond human faces to other object categories.

The article then delves into motion decomposition: zero-order decomposition assumes local rigidity around keypoints (all pixels in a region move with the same displacement as the keypoint), while first-order decomposition assumes local affine transformation (more flexible, accounting for rotation, scaling, and shear). The first-order approach uses Taylor expansion to approximate the motion mapping around keypoints, incorporating Jacobian matrices to model more complex deformations.

The Dense Motion Network is described as learning masks to identify regions where the rigidity/affine assumptions hold, and separately predicting non-rigid deformations. The article references the First Order Motion Model for Image Animation (NeurIPS 2019) as the primary source for these concepts.

neural networksface swappingimage animationoptical flowaffine transformationkeypoint detectionmotion decompositionTaylor expansion
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.