Lance: A Lightweight 3B Multimodal AI Model that Handles Vision, Video, Generation, and Editing
Lance, an open‑source 3‑billion‑parameter multimodal model from ByteDance, unifies image and video understanding, generation, and editing in a single architecture, achieves top scores on VBench (85.11), MVBench (62.0), GenEval (0.90) and GEdit‑Bench (7.30), and demonstrates emergent cross‑task generalization.
Problem
Existing multimodal models are either specialized for visual understanding or for generation, and large‑scale models have high training and deployment costs. A unified model that can see, draw and edit images and videos with a modest size is needed.
Architecture
Lance uses a dual‑stream mixture‑of‑experts (MoE) design. The understanding stream processes text tokens and semantic visual tokens for image/video comprehension, QA and reasoning. The generation stream processes VAE latent tokens for image/video generation and editing. Both streams share a common multimodal context while keeping their internal representations decoupled, enabling X2T, X2I and X2V tasks in a single forward pass.
Modality‑Aware Rotary Positional Encoding (MaPE)
MaPE injects modality‑specific group identifiers into rotary positional encoding. It distinguishes three token groups: semantic ViT tokens (understanding), clean VAE tokens (generation conditions) and noisy VAE tokens (generation targets). This prevents confusion when ordinary positional encodings are used.
Training strategy
Training proceeds in four stages: pre‑training, continual multi‑task training, supervised fine‑tuning and reinforcement learning. An observed finding is that adding more editing and subject‑driven generation data during the continual stage improves base generation quality, showing that multi‑task data can enhance rather than dilute generative ability.
Benchmark results
VBench: 85.11 (leading among unified models)
MVBench (video understanding): 62.0, 11.3 % relative gain over Show‑o2 7B
GenEval (image generation): 0.90 (tied for best overall)
GEdit‑Bench (editing): 7.30 Avg/G_O (highest among comparable models)
Demonstrations
Video generation: Complex textual prompts produce videos with coherent motion, consistent temporal dynamics and clear visual details.
Video editing: Three‑round editing changes hair style, adds a floral headband and replaces the background with a fairytale castle while preserving subject identity and motion continuity.
Image generation: Handles prompts requiring counting, attribute binding, spatial layout and style control, producing detailed images that follow the instructions.
Image editing: Supports subject addition/removal, local replacement, style transfer, motion adjustment and free‑form edits, all driven by natural‑language commands while maintaining subject identity and scene structure.
Understanding: Performs OCR, knowledge QA, multi‑image reasoning and fine‑grained video description, enabling video QA and temporal understanding.
References
Paper: https://arxiv.org/abs/2605.18678
Homepage: https://lance-project.github.io
GitHub: https://github.com/bytedance/Lance
Hugging Face: https://huggingface.co/bytedance-research/Lance
Code example
本文经AI新媒体量子位(公众号ID:qbitai )授权转载,转载请联系出处
本文
约2200字
,建议阅读
5
分钟
本文介绍了字节开源 Lance,轻量模型覆盖图文视频全模态能力。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
