Artificial Intelligence 9 min read

How Close Is Video Generation to Being Beautiful, Useful, Accurate? 1080‑Prompt, 7‑Model KIVI Benchmark

Researchers introduce KIVI, a knowledge‑intensive video generation benchmark with 1080 real‑world prompts, evaluating seven models using new FactP and HelpS metrics, revealing systematic errors such as entity mis‑depiction, procedural mistakes, and component misplacement, and showing a gap between human‑crafted and AI‑generated videos.

Machine Heart

Jun 15, 2026

How Close Is Video Generation to Being Beautiful, Useful, Accurate? 1080‑Prompt, 7‑Model KIVI Benchmark

When video generation moves beyond entertainment into scientific, medical, and educational domains, the key question becomes whether models can produce fact‑accurate, clear, and actionable videos rather than merely aesthetically pleasing ones.

KIVI: A Knowledge‑Intensive Video Generation Task

To address this, researchers define the Knowledge‑Intensive Video Generation (KIVI) task, which requires a model to start from a short prompt and generate a video that is factually correct and usable. They build the KIVI‑Bench evaluation set containing 1,080 prompts covering 18 categories such as automotive maintenance, healthcare, and electronics, created by LLM expansion and manual deduplication.

Prompt Construction Standards

The prompts follow five criteria: (1) Video superiority – visualizing actions or navigation more intuitively than text; (2) Fact correctness and verifiability – statements must be factual with publicly documented entities; (3) Knowledge‑challenging proper nouns – using specific product names (e.g., Bostitch pencil sharpener); (4) Beyond common sense – requiring genuine procedural knowledge (e.g., operating an Omron BP5450 blood pressure monitor); and (5) Real‑world phrasing – short, natural language matching user queries.

Automatic Evaluation Metrics: FactP and HelpS

Traditional visual quality metrics (Imaging Quality, Motion Smoothness) do not reflect content accuracy. KIVI therefore proposes two complementary automatic metrics:

FactP (Fact Precision) : extracts atomic statements from the video with an LLM, verifies each against external sources, and scores the proportion of correct statements.

HelpS (Helpfulness Score) : rates relevance, completeness, and clarity, answering whether a user could complete the task solely by watching the video.

Model Evaluation

Seven mainstream systems are evaluated: closed‑source APIs (Seedance 2.0, HappyHorse 1.0), open‑source short‑video models (Wan 2.2, HunyuanVideo 1.5), and open‑source long‑video models (Helios‑Base, LongCat‑Video, LongLive 1.0). Human‑crafted videos achieve FactP 97.8 % and HelpS 81.9 %, far surpassing all models. Among the models, HappyHorse 1.0 attains the highest FactP (83.2 %), Seedance 2.0 the highest HelpS (66.6 %). The best open‑source short‑video model, Wan 2.2, reaches FactP 73.1 % and HelpS 48.4 %, still lagging behind closed‑source systems. Short‑video models outperform long‑video models on both metrics.

Human Evaluation

On a 108‑pair subset, FactP aligns with human fact judgments at 70.8 % (vs. VBench‑Long’s 56.5 %), and HelpS aligns at 69.0 %. Traditional imaging quality correlates poorly (38.9 %).

Systematic Error Analysis

Analysis of 870 error statements uncovers three dominant failure modes:

Entity mis‑depiction (42.6 %) : models hallucinate nonexistent visual features, e.g., rendering a Bostitch electric pencil sharpener with a box‑shaped socket instead of its curved body.

Procedural errors (40.7 %) : models get the appearance right but execute steps incorrectly, such as binding an Omron BP5450 cuff on the forearm rather than the upper arm.

Component misplacement (15.0 %) : objects appear in implausible locations, e.g., oil and funnel placed in a car’s central armrest instead of the engine compartment.

Conclusion

KIVI defines a long‑overlooked direction for video generation: delivering reliable knowledge in knowledge‑intensive scenarios rather than merely creating entertaining visuals. The benchmark and its FactP/HelpS metrics highlight the gap between current models and human performance, and they point to the next frontier—transforming video generation from pixel‑level realism to a practical medium for knowledge acquisition.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

video generation benchmark FactP HelpS KIVI knowledge‑intensive AI

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.