Ming-Flash-Omni-Preview: 103B Open-Source Multimodal Model Excelling in Image, Video, and Speech
Introducing Ming‑Flash‑Omni‑Preview, a 103‑billion‑parameter open‑source multimodal model built on a sparse MoE architecture that delivers state‑of‑the‑art performance in controllable image generation, streaming video understanding, and context‑aware speech recognition, surpassing prior models on GenEval and GEdit benchmarks.
Model Overview
Ming‑Flash‑Omni‑Preview is the first open‑source multimodal model with parameters reaching the hundred‑billion scale (103B total, 9B active). It is based on the Ling 2.0 sparse MoE architecture and improves upon the earlier Ming‑lite‑omni‑1.5 in both understanding and generation across all modalities.
Key Capabilities
Controllable Image Generation : Introduces a generative‑segmentation‑as‑editing paradigm that treats image segmentation as a semantic‑preserving edit task, achieving fine‑grained spatial control and a 0.90 score on the GenEval benchmark, outperforming all non‑RL methods.
Streaming Video Understanding : Provides fine‑grained, real‑time comprehension of video content, recognizing objects and interactions and delivering contextual explanations.
Context‑Aware Speech Recognition (ContextASR) and Dialect Identification : Covers 15 Chinese dialects with state‑of‑the‑art accuracy on all ContextASR sub‑tasks.
Voice Cloning : Upgrades the speech tokenizer to a continuous version, achieving a seed‑tts‑zh WER of 0.99 and robust bilingual synthesis.
Architecture and Training Optimizations
Sparse MoE multimodal training: Extends the Ling‑flash‑2.0 sparse MoE to all modalities, using modal‑level routing to achieve “large capacity, small activation” for each modality and incorporating VideoRoPE in the attention layer for long‑video spatio‑temporal modeling.
Stable sparse training: Employs a mixed‑expert balancing scheme (auxiliary load‑balancing loss + router bias updates) to ensure uniform activation and convergence under sparsity.
Context‑aware ASR training paradigm: Conditions decoding on task/domain information, improving proper‑noun transcription and adding high‑quality dialect data for 15 Chinese dialects.
Generative segmentation‑editing co‑training: Reframes image segmentation as a semantic‑preserving edit task (e.g., “paint a banana purple”), unifying understanding and generation objectives and providing precise supervision for fine‑grained spatio‑temporal control.
Efficient full‑modal training architecture: Implements sequence packing to handle heterogeneous data and flexible encoder sharding (DP/PP/TP) to balance load and eliminate pipeline bubbles, doubling training throughput compared to the baseline.
Performance
On the GenEval benchmark, Ming‑Flash‑Omni‑Preview scores 0.90, surpassing all leading non‑RL methods. On the GEdit benchmark, its precise editing score rises from 6.9 to 7.9, confirming the effectiveness of the generative‑segmentation‑editing approach for fine‑grained control.
Open Source and Getting Started
The model and code are fully open‑source. GitHub repository: https://github.com/inclusionAI/Ming
HuggingFace hub: https://huggingface.co/inclusionAI/Ming-flash-omni-Preview
ModelScope: https://www.modelscope.cn/models/inclusionAI/Ming-flash-omni-Preview
Future Plans
Improve visual‑text understanding to close the gap with specialized VL models.
Enhance multi‑turn speech dialogue and high‑fidelity voice cloning.
Boost complex layout text rendering, editing, and IP‑specific image generation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
