Artificial Intelligence 15 min read

How AniME Automates Long‑Form Animation with a Director‑Driven Multi‑Agent AI Framework

AniME introduces a director‑driven multi‑agent system that combines a custom model‑selection protocol (MCP) with the open‑source AniSora V3 model to automatically generate consistent, high‑quality long‑form animation from story scripts, handling everything from storyboard creation to video editing and quality evaluation.

Bilibili Tech

Sep 4, 2025

How AniME Automates Long‑Form Animation with a Director‑Driven Multi‑Agent AI Framework

Introduction

Traditional animation pipelines involve many labor‑intensive stages—script writing, storyboard, character and scene design, animation, dubbing, and final editing—requiring large teams and long production cycles. Recent generative AI models such as AniSora have shown promise but still struggle with consistency and fine‑grained control across long videos.

All‑In‑One Model AniSora V3 Open Source

AniSora V3, the latest open‑source animation video generation model, supports single‑GPU 4090 inference for 5‑second 360p clips in 30 seconds and 8‑GPU A800 inference in 8 seconds. It improves dynamics, visual quality, and prompt compliance, and adds multimodal interaction capabilities tailored to animation workflows.

360° Character Portrait Generation

The model can generate a full 360° character view from a single front portrait, enabling consistent multi‑view character assets.

Arbitrary Frame Guidance

Building on V1, V3 enhances prompt compliance for arbitrary frame guidance, allowing video generation from any chosen start, end, or intermediate frame based on the story.

Ultra Low‑Resolution Super‑Resolution

Supports up‑scaling from 90p to 720p/1080p, producing richer details with reduced inference time.

AniME Architecture

AniME decomposes the story‑to‑video task into hierarchical stages managed by a Director Agent and several Specialized Agents. Each agent has defined input/output types and a Model Context Protocol (MCP) toolbox. Agents communicate via structured JSON messages.

Director Agent and Multi‑Agent Collaboration Process

The Director Agent acts as the central controller, splitting a long story into scenes and shots, determining visual and acoustic styles, generating an initial task list, and building a workflow graph that encodes dependencies. It also maintains a global Asset Memory Bank to store approved assets (characters, scenes, styles) ensuring cross‑shot consistency.

Specialized Agents and MCP Mechanism

Four main specialized agents are defined:

Script & Storyboard Agent : Parses narrative text, performs camera planning, and generates keyframes using appropriate image‑generation tools.

Character Designer : Generates multi‑view character images from text prompts, ensuring identity consistency.

Scene Designer : Produces layered background assets with layout‑guided or depth‑guided generation.

Animator : Synthesizes motion sequences from keyframes, poses, and camera trajectories, using frame‑guided video diffusion models.

Audio Production Agent : Generates speech (TTS) and music, then mixes them.

Video Editor Agent : Assembles all assets into a final video with automated editing and FFmpeg encoding.

Quality Evaluator Agent : Scores each stage using multimodal metrics (text‑to‑video similarity, identity verification, audio‑visual alignment) and feeds results back to the Director for possible re‑generation.

The MCP mechanism lets each specialized agent autonomously select the most suitable model/toolbox based on the task context, outputting structured JSON such as:

{
  "shot_id": "scene_YX01_shot_01",
  "tool": "reference_image_generation",
  "prompt": "Ye holding a blue-and-white porcelain cup, tilting head to drink",
  "reference_images": ["assets/char_YX_front.png"]
}

Similar JSON examples illustrate layout‑guided generation and bounding‑box specifications for subsequent frames.

Results

AniME has been used internally to generate end‑to‑end anime content from novel excerpts. The system produces coherent long‑form videos, as demonstrated in the linked Bilibili video, and visualized in the architecture and workflow diagrams.

Conclusion

The paper presents AniME, a director‑driven multi‑agent framework that leverages an MCP‑based model selection mechanism to achieve fully automated, consistent, and high‑quality long‑form animation generation from textual stories.

References

1. Anthropic. 2024. Introducing the Model Context Protocol. http://www.anthropic.com/news/model-context-protocol

2. Chenpeng Du et al. 2025. Vall‑t: Decoder‑only generative transducer for robust and decoding‑controllable text‑to‑speech. In ICASSP.

3. Yudong Jiang et al. 2024. AniSora: Exploring the frontiers of animation video generation in the Sora era. arXiv:2412.10255.

4. Yunxin Li et al. 2024. Anim‑director: A large multimodal model powered agent for controllable animation video generation. In SIGGRAPH Asia.

5. Navonil Majumder et al. 2024. Tango 2: Aligning‑based text‑to‑audio generations through direct preference optimization. In ACM MM.

6. Haoyuan Shi et al. 2025. AniMaker: Automated Multi‑Agent Animated Storytelling with MCTS‑Driven Clip Generation. arXiv:2506.10540.

7. Weijia Wu et al. 2025. Automated movie generation via multi‑agent COT planning. arXiv:2503.07314.

8. Haotian Xia et al. 2025. StoryWriter: A Multi‑Agent Framework for Long Story Generation. arXiv:2506.16445.

9. Ling Yang et al. 2024. Mastering Text‑to‑Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs. In ICML.

animation Generative AI Multi-agent video synthesis Storyboard

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.