Ming-Flash-Omni-Preview: 103B Open-Source Multimodal Model Excelling in Image, Video, and Speech

Introducing Ming‑Flash‑Omni‑Preview, a 103‑billion‑parameter open‑source multimodal model built on a sparse MoE architecture that delivers state‑of‑the‑art performance in controllable image generation, streaming video understanding, and context‑aware speech recognition, surpassing prior models on GenEval and GEdit benchmarks.

AntTech
AntTech
AntTech
Ming-Flash-Omni-Preview: 103B Open-Source Multimodal Model Excelling in Image, Video, and Speech

Model Overview

Ming‑Flash‑Omni‑Preview is the first open‑source multimodal model with parameters reaching the hundred‑billion scale (103B total, 9B active). It is based on the Ling 2.0 sparse MoE architecture and improves upon the earlier Ming‑lite‑omni‑1.5 in both understanding and generation across all modalities.

图片
图片

Key Capabilities

Controllable Image Generation : Introduces a generative‑segmentation‑as‑editing paradigm that treats image segmentation as a semantic‑preserving edit task, achieving fine‑grained spatial control and a 0.90 score on the GenEval benchmark, outperforming all non‑RL methods.

Streaming Video Understanding : Provides fine‑grained, real‑time comprehension of video content, recognizing objects and interactions and delivering contextual explanations.

Context‑Aware Speech Recognition (ContextASR) and Dialect Identification : Covers 15 Chinese dialects with state‑of‑the‑art accuracy on all ContextASR sub‑tasks.

Voice Cloning : Upgrades the speech tokenizer to a continuous version, achieving a seed‑tts‑zh WER of 0.99 and robust bilingual synthesis.

Architecture and Training Optimizations

Sparse MoE multimodal training: Extends the Ling‑flash‑2.0 sparse MoE to all modalities, using modal‑level routing to achieve “large capacity, small activation” for each modality and incorporating VideoRoPE in the attention layer for long‑video spatio‑temporal modeling.

Stable sparse training: Employs a mixed‑expert balancing scheme (auxiliary load‑balancing loss + router bias updates) to ensure uniform activation and convergence under sparsity.

Context‑aware ASR training paradigm: Conditions decoding on task/domain information, improving proper‑noun transcription and adding high‑quality dialect data for 15 Chinese dialects.

Generative segmentation‑editing co‑training: Reframes image segmentation as a semantic‑preserving edit task (e.g., “paint a banana purple”), unifying understanding and generation objectives and providing precise supervision for fine‑grained spatio‑temporal control.

Efficient full‑modal training architecture: Implements sequence packing to handle heterogeneous data and flexible encoder sharding (DP/PP/TP) to balance load and eliminate pipeline bubbles, doubling training throughput compared to the baseline.

Performance

On the GenEval benchmark, Ming‑Flash‑Omni‑Preview scores 0.90, surpassing all leading non‑RL methods. On the GEdit benchmark, its precise editing score rises from 6.9 to 7.9, confirming the effectiveness of the generative‑segmentation‑editing approach for fine‑grained control.

Open Source and Getting Started

The model and code are fully open‑source. GitHub repository: https://github.com/inclusionAI/Ming

HuggingFace hub: https://huggingface.co/inclusionAI/Ming-flash-omni-Preview

ModelScope: https://www.modelscope.cn/models/inclusionAI/Ming-flash-omni-Preview

Future Plans

Improve visual‑text understanding to close the gap with specialized VL models.

Enhance multi‑turn speech dialogue and high‑fidelity voice cloning.

Boost complex layout text rendering, editing, and IP‑specific image generation.

Large Language ModelmultimodalImage Generationvideo understandingSpeech Recognitionsparse MoE
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.