Artificial Intelligence 3 min read

All‑In‑One Image & Video: ByteDance’s Deployable Native Multimodal Model Lance

Lance, ByteDance’s newly open‑sourced 3‑billion‑parameter multimodal model, runs on a single 40 GB GPU, tops HuggingFace trend charts, and achieves leading scores on DPG Bench, GenEval, and video generation benchmarks while surpassing several state‑of‑the‑art single‑modal models.

SuanNi

May 22, 2026

All‑In‑One Image & Video: ByteDance’s Deployable Native Multimodal Model Lance

ByteDance has open‑sourced a native multimodal model called Lance . With only 3 B activation parameters it can run on a single 40 GB GPU, making it a truly local, all‑round “six‑sided warrior” for image and video tasks.

The model quickly rose to the top of the HuggingFace trend list; within a day the community released many quantized versions that run under 24 GB VRAM.

All input modalities—text‑to‑text (X2T), text‑to‑image (X2I) and text‑to‑video (X2V)—are encoded into a unified MaPE (Modality‑aware Positional Encoding) enhanced multimodal context sequence. A dual‑expert backbone processes this shared context with generalized 3‑D causal attention, producing task‑specific hidden states that are decoded by an LM head for next‑token prediction and by a flow head for velocity prediction in visual latent space.

On the DPG Bench and GenEval image‑generation benchmarks, Lance, despite its modest size, ranks first across multiple multimodal metrics and even outperforms leading open‑source single‑modal models such as Flux and Qwen‑Image. In video‑generation benchmarks it surpasses open‑source baselines and rivals closed‑source solutions.

Additional evaluations show that Lance’s image‑editing capabilities exceed GPT‑Image‑1 and Qwen‑Image‑Edit, while its video‑understanding performance beats many specialized models. The demo suite includes video generation, multi‑round consistent video editing, physics‑guided video generation, Q&A, text‑to‑image, free‑form image editing, and image understanding.

References and resources include the HuggingFace model page, the project website, the GitHub repository, and the accompanying arXiv paper (https://arxiv.org/pdf/2605.18678).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

video generation image generation AI research multimodal model ByteDance Lance

Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.