All‑In‑One Image & Video: ByteDance’s Deployable Native Multimodal Model Lance
Lance, ByteDance’s newly open‑sourced 3‑billion‑parameter multimodal model, runs on a single 40 GB GPU, tops HuggingFace trend charts, and achieves leading scores on DPG Bench, GenEval, and video generation benchmarks while surpassing several state‑of‑the‑art single‑modal models.
ByteDance has open‑sourced a native multimodal model called Lance . With only 3 B activation parameters it can run on a single 40 GB GPU, making it a truly local, all‑round “six‑sided warrior” for image and video tasks.
The model quickly rose to the top of the HuggingFace trend list; within a day the community released many quantized versions that run under 24 GB VRAM.
All input modalities—text‑to‑text (X2T), text‑to‑image (X2I) and text‑to‑video (X2V)—are encoded into a unified MaPE (Modality‑aware Positional Encoding) enhanced multimodal context sequence. A dual‑expert backbone processes this shared context with generalized 3‑D causal attention, producing task‑specific hidden states that are decoded by an LM head for next‑token prediction and by a flow head for velocity prediction in visual latent space.
On the DPG Bench and GenEval image‑generation benchmarks, Lance, despite its modest size, ranks first across multiple multimodal metrics and even outperforms leading open‑source single‑modal models such as Flux and Qwen‑Image. In video‑generation benchmarks it surpasses open‑source baselines and rivals closed‑source solutions.
Additional evaluations show that Lance’s image‑editing capabilities exceed GPT‑Image‑1 and Qwen‑Image‑Edit, while its video‑understanding performance beats many specialized models. The demo suite includes video generation, multi‑round consistent video editing, physics‑guided video generation, Q&A, text‑to‑image, free‑form image editing, and image understanding.
References and resources include the HuggingFace model page, the project website, the GitHub repository, and the accompanying arXiv paper (https://arxiv.org/pdf/2605.18678).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
