Artificial Intelligence 9 min read

Lance: A Lightweight 3B Multimodal AI Model that Handles Vision, Video, Generation, and Editing

Lance, an open‑source 3‑billion‑parameter multimodal model from ByteDance, unifies image and video understanding, generation, and editing in a single architecture, achieves top scores on VBench (85.11), MVBench (62.0), GenEval (0.90) and GEdit‑Bench (7.30), and demonstrates emergent cross‑task generalization.

Data Party THU

Jun 21, 2026

Lance: A Lightweight 3B Multimodal AI Model that Handles Vision, Video, Generation, and Editing

Problem

Existing multimodal models are either specialized for visual understanding or for generation, and large‑scale models have high training and deployment costs. A unified model that can see, draw and edit images and videos with a modest size is needed.

Architecture

Lance uses a dual‑stream mixture‑of‑experts (MoE) design. The understanding stream processes text tokens and semantic visual tokens for image/video comprehension, QA and reasoning. The generation stream processes VAE latent tokens for image/video generation and editing. Both streams share a common multimodal context while keeping their internal representations decoupled, enabling X2T, X2I and X2V tasks in a single forward pass.

Modality‑Aware Rotary Positional Encoding (MaPE)

MaPE injects modality‑specific group identifiers into rotary positional encoding. It distinguishes three token groups: semantic ViT tokens (understanding), clean VAE tokens (generation conditions) and noisy VAE tokens (generation targets). This prevents confusion when ordinary positional encodings are used.

Training strategy

Training proceeds in four stages: pre‑training, continual multi‑task training, supervised fine‑tuning and reinforcement learning. An observed finding is that adding more editing and subject‑driven generation data during the continual stage improves base generation quality, showing that multi‑task data can enhance rather than dilute generative ability.

Benchmark results

VBench: 85.11 (leading among unified models)

MVBench (video understanding): 62.0, 11.3 % relative gain over Show‑o2 7B

GenEval (image generation): 0.90 (tied for best overall)

GEdit‑Bench (editing): 7.30 Avg/G_O (highest among comparable models)

Demonstrations

Video generation: Complex textual prompts produce videos with coherent motion, consistent temporal dynamics and clear visual details.

Video editing: Three‑round editing changes hair style, adds a floral headband and replaces the background with a fairytale castle while preserving subject identity and motion continuity.

Image generation: Handles prompts requiring counting, attribute binding, spatial layout and style control, producing detailed images that follow the instructions.

Image editing: Supports subject addition/removal, local replacement, style transfer, motion adjustment and free‑form edits, all driven by natural‑language commands while maintaining subject identity and scene structure.

Understanding: Performs OCR, knowledge QA, multi‑image reasoning and fine‑grained video description, enabling video QA and temporal understanding.

References

Paper: https://arxiv.org/abs/2605.18678

Homepage: https://lance-project.github.io

GitHub: https://github.com/bytedance/Lance

Hugging Face: https://huggingface.co/bytedance-research/Lance

Code example

本文经AI新媒体量子位（公众号ID:qbitai ）授权转载，转载请联系出处
本文
约2200字
，建议阅读
5
分钟
本文介绍了字节开源 Lance，轻量模型覆盖图文视频全模态能力。

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal AI video generation Lance benchmark results MaPE dual-stream MoE

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.