Animate Any Image with First Order Motion Model: A Step‑by‑Step Guide
This tutorial explains the First Order Motion Model for animating static images, covering the algorithm's keypoint‑based motion estimation, required datasets, environment setup with Python, OpenCV and ffmpeg, and provides complete code snippets to generate animated videos with audio.
Introduction
Animating a static picture can be achieved using the First Order Motion Model, a deep‑learning technique originally presented at NeurIPS 2019. The model can make any image move, such as making a character from "Game of Thrones" speak like a politician or making a horse run.
Algorithm Principles
The First Order Motion Model uses a set of self‑learned keypoints and local affine transformations to build a complex motion model. It consists of two main modules:
Motion estimation module – separates appearance and motion information via self‑supervised learning and creates feature representations.
Image generation module – models occlusions during motion and combines extracted appearance with the motion features to synthesize the final image.
The model was trained and tested on four datasets: VoxCeleb, UvA‑Nemo, the BAIR robot‑pushing dataset, and a custom‑collected dataset. VoxCeleb contains around 100 k audio clips from 1 251 celebrities, balanced by gender and covering diverse accents, professions, and ages.
Environment Setup
Install the required third‑party libraries using the provided requirements.txt file: python -m pip install -r requirements.txt Configure ffmpeg (download from ffmpeg official site ) and add it to your system PATH.
Implementation
The project Real Time Image Animation uses the First Order Motion Model to animate a static image based on a driving video. Below are the essential code snippets.
Utility Functions
import subprocess
import os
from PIL import Image
def video2mp3(file_name):
"""Convert a video file to an MP3 audio file."""
outfile_name = file_name.split('.')[0] + '.mp3'
cmd = f'ffmpeg -i {file_name} -f mp3 {outfile_name}'
subprocess.call(cmd, shell=True)
def video_add_mp3(file_name, mp3_file):
"""Add an MP3 audio track to a video file."""
outfile_name = file_name.split('.')[0] + '-f.mp4'
subprocess.call(f'ffmpeg -i {file_name} -i {mp3_file} -strict -2 -f mp4 {outfile_name}', shell=True)Main Script
import imageio, torch, cv2, numpy as np
from tqdm import tqdm
from animate import normalize_kp
from demo import load_checkpoints
from skimage import img_as_ubyte
from skimage.transform import resize
import argparse
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input_image", required=True, help="Path to image to animate")
ap.add_argument("-c", "--checkpoint", required=True, help="Path to checkpoint")
ap.add_argument("-v", "--input_video", required=False, help="Path to video input")
args = vars(ap.parse_args())
source_path = args['input_image']
checkpoint_path = args['checkpoint']
video_path = args.get('input_video')
source_image = imageio.imread(source_path)
source_image = resize(source_image, (256, 256))[..., :3]
generator, kp_detector = load_checkpoints(config_path='config/vox-256.yaml', checkpoint_path=checkpoint_path)
if not os.path.exists('output'):
os.mkdir('output')
cap = cv2.VideoCapture(video_path if video_path else 0)
fps = cap.get(cv2.CAP_PROP_FPS)
size = (int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)), int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)))
fourcc = cv2.VideoWriter_fourcc('M','P','E','G')
out1 = cv2.VideoWriter('output/test.mp4', fourcc, fps, size, True)
cv2_source = cv2.cvtColor(source_image.astype('float32'), cv2.COLOR_BGR2RGB)
source = torch.tensor(source_image[np.newaxis].astype(np.float32)).permute(0, 3, 1, 2).cuda()
kp_source = kp_detector(source)
count = 0
while True:
ret, frame = cap.read()
if not ret:
break
frame = cv2.flip(frame, 1)
frame_resized = resize(frame, (256, 256))[..., :3]
if count == 0:
kp_driving_initial = kp_detector(torch.tensor(frame_resized[np.newaxis].astype(np.float32)).permute(0, 3, 1, 2).cuda())
driving_frame = torch.tensor(frame_resized[np.newaxis].astype(np.float32)).permute(0, 3, 1, 2).cuda()
kp_driving = kp_detector(driving_frame)
kp_norm = normalize_kp(kp_source=kp_source, kp_driving=kp_driving, kp_driving_initial=kp_driving_initial,
use_relative_movement=True, use_relative_jacobian=True, adapt_movement_scale=True)
out = generator(source, kp_source=kp_source, kp_driving=kp_norm)
pred = np.transpose(out['prediction'].data.cpu().numpy(), [0, 2, 3, 1])[0]
pred_bgr = cv2.cvtColor(pred, cv2.COLOR_RGB2BGR)
out1.write(img_as_ubyte(pred_bgr))
count += 1
cap.release()
out1.release()
cv2.destroyAllWindows()
if video_path:
video2mp3(video_path)
video_add_mp3('output/test.mp4', video_path.split('.')[0] + '.mp3')Running the Demo
Download the pretrained weights, video and image assets (a packaged zip is provided) and execute:
python image_animation.py -i path_to_input_file -c path_to_checkpoint -v path_to_video_fileFor a quick test you can run:
python image_animation.py -i Inputs/trump2.png -c checkpoints/vox-cpk.pth.tar -v 1.mp4The resulting animated video will be saved in the output directory.
Conclusion
The First Order Motion Model enables fast, GPU‑accelerated animation of static images, and with the provided scripts you can combine the generated video with audio using ffmpeg.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
