How AI Powers Fancy Video Generation for Real‑World POI Scenes

This article details the AI techniques behind Gaode's "Street Ranking" project, explaining the Fancy video concept, the dual training and production pipelines, and the use of SFT, reinforcement learning, MoE‑LoRA, distribution‑matching distillation, and quality‑filtering to achieve 25× faster generation with high aesthetic fidelity.

Amap Tech
Amap Tech
Amap Tech
How AI Powers Fancy Video Generation for Real‑World POI Scenes

Fancy Video Definition and Demo

In the Gaode Street Ranking, static images cannot fully convey a POI’s characteristics, so Fancy Video is defined to extract high‑dynamic elements from a single image, render physically plausible motion, and apply aesthetic camera work to improve information transmission and emotional impact.

Core Concept: What Is Fancy?

Fancy Video is more than animating a picture; it is a scene‑level visual enhancement that must obey real‑world physics while adopting a photographer’s aesthetic perspective within a short duration (≈5 s). The generation formula is:

Fancy = High‑Potential Composition + (Physical Realism × Camera Aesthetics × Atmosphere Enhancement)

High‑Potential Composition : A multimodal model selects images with clear composition, rich colors, and dynamic potential (e.g., liquids, smoke, light).

Physical Realism : Motion follows basic physics such as gravity for fluids and diffusion for smoke, avoiding AI‑hallucinated artifacts.

Atmosphere Enhancement : Simulated depth‑of‑field and lighting amplify the focal subject.

Three Vertical Scene Definitions

The project targets three high‑frequency POI categories, each with specific visual goals and dynamic expressions.

Food : Restore texture and temperature, showcase steam, sizzling oil, and micro‑focus on food details.

Scenery : Convey immersion by animating clouds, waterfalls, and natural light rhythms.

Hotels : Convey tranquility and luxury through subtle camera moves, gentle lighting, and spatial depth.

Admission Standards: From Baseline to Fancy

Two‑tier quality checks ensure consistency and appeal.

Baseline : Stable frames, no obvious artifacts, consistent subject.

Fancy : In addition to baseline, the video must contain dynamic elements that boost attractiveness, such as natural steam or bubbling broth for a hot‑pot scene.

Model SFT (Supervised Fine‑Tuning)

Data collection follows scene‑specific logic, emphasizing high‑dynamic moments (e.g., steam, oil shimmer). The Qwen2.5‑VL multimodal model annotates both static attributes and dynamic behaviors (e.g., "clockwise rotation", "steam rising").

Two‑stage fine‑tuning is applied:

Stage 1 – Full‑Parameter Fine‑Tuning (Domain Alignment)

Goal : Adapt a generic video model to real‑world POI distributions.

Method : Unlock all parameters and fit on tens of thousands of high‑quality, realistic videos.

Benefit : Aligns the model with physical camera optics and reduces AI‑style artifacts.

Stage 2 – MoE‑LoRA Hybrid Expert Fine‑Tuning (Aesthetic Capture)

Mixture‑of‑Experts (MoE) LoRA adapters are created for each vertical (food, scenery, hotel). A router network detects the input image’s semantics and activates the corresponding expert, enabling zero‑switch inference without manual LoRA selection.

Specificity : Experts operate independently, avoiding over‑fitting across disparate scenes.

Zero‑Switch Inference : The system automatically blends expert weights at millisecond latency.

Performance : Maintains low training cost while delivering superior aesthetic quality.

Model RL (Reinforcement Learning)

To align generation with the Fancy standards, the Group Policy Gradient (GPG) algorithm replaces PPO/GRPO, offering a minimalist gradient‑only approach that eliminates the critic network and KL constraints.

Group Sampling : For a given prompt (e.g., "boiling hot‑pot"), multiple videos are generated in parallel.

Direct Gradient Optimization : GPG estimates gradients from relative advantages within the group.

Efficiency : Reduces memory usage and training instability.

The RL loop incorporates a self‑developed multimodal reward model trained on the GenVID expert dataset, which evaluates videos on physics, visual cleanliness, lighting coherence, narrative focus, and affective resonance.

Model Distillation (DMD – Distribution Matching Distillation)

To meet massive production demands, the 50‑step generation pipeline is compressed to 4 steps using DMD with GAN loss. Fancy videos serve as the real‑distribution target, while a student model learns to preserve high‑dynamic and atmospheric cues.

Speed & Quality : Production time drops from ~100 min to ~4 min per video; inference cost reduces 12.5×.

Backward Simulation : Guarantees stability when reducing steps.

Inference Optimizations

Test‑time scaling and Classifier‑Free Guidance (CFG) are applied to handle long‑tail, semantically complex inputs. Semantic distance is measured; high‑difficulty prompts trigger expanded sampling and adaptive reward weighting, improving alignment without large computational overhead.

Quality Filter (Automatic Rough Screening)

A multimodal large‑language model (based on Qwen‑VL) is fine‑tuned to classify videos as Fancy or Not Fancy and to provide dimension‑wise feedback (e.g., "excessive watermark", "physics violation"). The system achieves >98% precision at 95% recall, reducing manual review by over 20×.

End‑to‑End Automated Deployment

The pipeline integrates data ingestion, model inference, quality filtering, and automated publishing. Unique keys map various ranking lists (daily, yearly, SKU, SPU) to a unified database, enabling scalable upload and withdrawal without manual intervention.

Summary

The Gaode Street Ranking project demonstrates a full stack of generative‑AI techniques—full‑parameter fine‑tuning, MoE‑LoRA expert routing, minimalist RL with GPG, distribution‑matching distillation, and multimodal quality filtering—to transform static POI images into high‑fidelity Fancy videos. The combined innovations yield a 25× boost in production efficiency and a 12.5× reduction in inference cost while preserving strict physical realism and aesthetic standards, proving that AI can reliably serve real‑world visual commerce at scale.

Data Processing Pipeline
Data Processing Pipeline
Model Training and Structure Example
Model Training and Structure Example
Flow‑GRPO Method Diagram
Flow‑GRPO Method Diagram
DMD Distillation Example
DMD Distillation Example
S²‑Guidance Example
S²‑Guidance Example
model fine-tuningmultimodalreinforcement learningDistillationAI video generation
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.