How OmniVideo-100K Generates High‑Quality Audio‑Video Training Data for Better Multimodal Understanding

The article analyzes why existing audio‑video QA pipelines break narrative continuity, proposes a structured‑script and evidence‑chain approach to automatically build the OmniVideo-100K dataset of 100K high‑quality QA pairs, and shows that fine‑tuning open‑source multimodal models on this data yields consistent accuracy gains across multiple benchmarks.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
How OmniVideo-100K Generates High‑Quality Audio‑Video Training Data for Better Multimodal Understanding

Current audio‑video QA data pipelines typically split videos into short clips, generate separate visual and audio captions, and directly produce question‑answer pairs, which often loses cross‑segment entity consistency and temporal causality, leading models to answer only superficial questions.

To address this, Nanjing University and the Institute of Automation of the Chinese Academy of Sciences introduced OmniVideo-100K , a 100‑kilobyte audio‑video QA dataset built through a two‑stage pipeline:

Structured Script Construction : each video is converted into a structured script containing a global summary, a list of main entities, timestamps, speaker labels, non‑speech sounds, and detailed visual descriptions. Entity anchoring ensures that the same character or object retains a stable identifier across all segments.

Clue‑Guided QA Generation : a large language model first scans the full script to extract cross‑segment, multimodal evidence chains (e.g., a spoken cue in the first minute and an action in the fourth minute). Questions are then generated around these evidence chains, guaranteeing that answers require reasoning over both audio and visual modalities.

The pipeline efficiently produces 10 000 + videos covering domains such as vlogs, news, cartoons, sports, documentaries, and ego‑centric footage, yielding 100 000 QA pairs (7:3 open‑ended to multiple‑choice). Low‑resolution and hard‑subtitle videos are filtered, and samples are selected based on visual dynamics and speech density.

Fine‑tuning three open‑source models—VITA‑1.5, Qwen2.5‑Omni‑7B, and Qwen3‑Omni‑30B—on OmniVideo‑100K improves their overall accuracy on the held‑out OmniVideo‑Test set by 20.59 %, 17.82 %, and 13.86 % respectively. On the more challenging clue‑guided subset, accuracy drops to 59.15 % but still outperforms baselines, and the average temporal span of questions increases from 76.24 s to 144.75 s, demonstrating deeper cross‑modal reasoning.

Cross‑benchmark evaluation shows consistent gains: VITA‑1.5’s JointAVBench score rises by 12.64 %; Qwen2.5‑Omni’s Daily‑Omni score improves from 62.41 % to 69.84 %; and Qwen3‑Omni’s AV Event Alignment score jumps from 51.26 % to 68.49 %. Importantly, these improvements do not degrade general video understanding, as evaluated on Video‑MME and its v2 variant.

Qualitative analysis confirms that after fine‑tuning, models rely more on cross‑segment evidence rather than single‑modality shortcuts. The authors conclude that high‑quality, evidence‑anchored audio‑video data, rather than sheer quantity, is crucial for advancing multimodal large models, and they release all training and test data publicly.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

model fine-tuningBenchmark Evaluationmultimodal QAaudio-video datasetevidence chainOmniVideo-100Kstructured script
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.