Artificial Intelligence 11 min read

How OmniLottie Turns Text and Images into High‑Quality Vector Animations

OmniLottie, a collaborative framework from Fudan, HKU, and Queensland University, uses a specialized tokenizer and a large multimodal model to compress Lottie files, generate vector animations from text, images or video, and achieves state‑of‑the‑art performance on custom benchmarks and extensive evaluations.

SuanNi

Apr 6, 2026

How OmniLottie Turns Text and Images into High‑Quality Vector Animations

OmniLottie is an open‑source ecosystem that combines AI large‑model services, algorithms, and compute to generate lightweight vector animations directly from natural language or images. The framework was jointly developed by researchers from Fudan University, the Hong Kong University Multimodal Lab, and the University of Queensland.

Traditional Pain Points

Typical digital content creation relies on two types of animation: bitmap video, which is large and loses quality when scaled, and vector animation, which stores geometry and motion as mathematical formulas, offering scalability and small file size. While the Lottie format is popular for its portability, its JSON representation is verbose, containing a lot of structural metadata that does not contribute to visual or motion information. This verbosity makes it difficult for generative models to produce valid Lottie code efficiently, leading to wasted compute on bracket matching and code alignment rather than on visual dynamics.

Breakthrough Method

The team redesigned the underlying data representation and built a dedicated tokenizer for Lottie files. The tokenizer strips unnecessary structural metadata and extracts only the attributes tightly coupled with animation, converting continuous numeric parameters into discrete symbols. This results in a compact linear code that preserves flexibility while dramatically reducing length.

On the model side, a pre‑trained multimodal large model (Qwen2.5‑VL) serves as the core. A custom vocabulary enables the model to ingest text, images, or video and sequentially predict the compact token stream. The tokenizer then reconstructs the token stream into a standard Lottie animation that can be played on any device.

Dataset Construction

To train the system, the researchers built a multimodal vector‑animation dataset called MMLottie‑2M, containing roughly two million curated Lottie files. They harvested raw files from several major platforms, removed non‑animation assets (e.g., unrelated images, audio, or proprietary code), and wrote scripts to clean hierarchical information. For richer motion patterns, they leveraged the OmniSVG library of static images and extracted key‑frame transformations (rotation, scale, position, opacity) from one million real files, clustering similar trajectories into reusable motion templates. All assets were normalized to 512×512 resolution and a 0‑16 timestamp range.

Precise textual descriptions were generated for each animation using a coarse‑to‑fine strategy: first converting the animation to video and prompting a vision model to produce a high‑level summary of subject, color, and style, then adding frame‑level prompts that highlight shape and motion keywords.

Benchmarks and Results

The authors introduced MMLottie‑Bench, a benchmark comprising 450 high‑quality real animations collected from professional designers, plus synthetic instruction data generated by GPT‑4o, Gemini‑3.1‑Pro‑Vision, and Seedance1.0. Models evaluated include OmniLottie, DeepSeekV3, Qwen2.5‑VL, GPT‑5, Recraft, AniClipart, and Livesketch. Evaluation metrics are FVD, CLIP similarity, and a dual‑score rubric (object matching and motion matching) rated by Claude‑3.5‑Sonnet on a 0‑10 scale.

OmniLottie achieved an 88.3% success rate on text‑to‑animation tasks, outperforming DeepSeekV3 (9.3%) and GPT‑5 (12.7%). On image‑to‑animation, OmniLottie maintained a 93.3% success rate with smooth motion and preserved visual fidelity, while other tools suffered from low success or poor quality. For video‑to‑animation, OmniLottie reconstructed original scenes with high scores, whereas Gemini and Qwen series failed to produce usable code.

Additional ablation studies showed that mixing 30% static‑image‑derived data with the original dataset yielded the best performance; excessive synthetic data degraded motion matching. A comparison of raw large‑model decoding (Q), fine‑tuned code input (J), and the specialized tokenizer (T) demonstrated that the tokenizer raised success from 13.4% (J) to 97.3% (T) and significantly sped up inference.

Limitations and Future Work

Current sequential decoding occasionally generates invalid segments, especially for long animations with dozens of layers, due to token‑length constraints. The authors plan to incorporate reward‑based scoring and tighter integration with professional design software to make the technology more practical in production pipelines.

Overall, OmniLottie acts like a razor, trimming redundant code and enabling large models to understand and generate dynamic vector graphics efficiently.