Tagged articles
30 articles
Page 1 of 1
Machine Heart
Machine Heart
Apr 27, 2026 · Artificial Intelligence

Why Traditional Video Captions Fail and How MTSS Solves the Problem

The article introduces Multi-Stream Scene Script (MTSS), a structured JSON‑based video description paradigm that replaces monolithic captions, explains its design principles, compares its advantages, and presents experimental evidence showing significant gains in both video understanding and generation tasks.

MTSSMultimodal AIVideo Generation
0 likes · 8 min read
Why Traditional Video Captions Fail and How MTSS Solves the Problem
SuanNi
SuanNi
Apr 19, 2026 · Artificial Intelligence

Why Multimodal Video Models Still Miss the Mark: Inside the New Video‑MME‑v2 Benchmark

The Video‑MME‑v2 benchmark reveals that current multimodal video models, despite high leaderboard scores, struggle with genuine video understanding, thanks to a rigorous three‑layer evaluation, non‑linear scoring, and a meticulously curated 800‑video dataset that exposes their true intelligence limits.

AI EvaluationVideo-MMElarge language models
0 likes · 10 min read
Why Multimodal Video Models Still Miss the Mark: Inside the New Video‑MME‑v2 Benchmark
Machine Heart
Machine Heart
Apr 13, 2026 · Artificial Intelligence

Why the Top Video Model Scores Only 49: Introducing Video‑MME‑v2 by Nanjing University

The new Video‑MME‑v2 benchmark reveals that despite saturated high scores on existing video‑understanding tests, the strongest commercial model (Gemini‑3‑Pro) reaches only 49.4 points versus a human expert’s 90.7, highlighting the benchmark’s layered ability system, group‑level non‑linear scoring, and the nuanced impact of "Thinking" features.

AI Evaluationlarge modelsmultimodal benchmark
0 likes · 11 min read
Why the Top Video Model Scores Only 49: Introducing Video‑MME‑v2 by Nanjing University
AI Explorer
AI Explorer
Apr 7, 2026 · Artificial Intelligence

MedGRPO Redefines Med Video Understanding, Shifting AI from Assistant to Partner

MedGRPO, a multimodal large model, achieves a breakthrough in medical video understanding by introducing clinical semantic parsing that aligns visual cues with structured medical knowledge, boosting performance and raising ethical questions about AI’s evolving role from a supportive assistant to a collaborative clinical partner.

AI ethicsClinical Semantic Parsingmedical-ai
0 likes · 6 min read
MedGRPO Redefines Med Video Understanding, Shifting AI from Assistant to Partner
AI Algorithm Path
AI Algorithm Path
Dec 23, 2025 · Artificial Intelligence

Fine‑Tuning Qwen‑Video‑8B with LLaMA‑Factory for Domain‑Specific Video Understanding

This article details how the Qwen‑Video‑8B model, built on Qwen3‑VL‑8B‑Instruct, is fine‑tuned with the LLaMA‑Factory framework using a curated city‑scenery dataset, addresses challenges of domain knowledge, temporal modeling and multimodal fusion, and demonstrates improved video captioning across baseline, English‑fine‑tuned and Chinese‑fine‑tuned versions.

AI fine-tuningLLaMA-FactoryLoRA
0 likes · 10 min read
Fine‑Tuning Qwen‑Video‑8B with LLaMA‑Factory for Domain‑Specific Video Understanding
AI Frontier Lectures
AI Frontier Lectures
Dec 9, 2025 · Artificial Intelligence

CrossVid: The New Benchmark Exposing AI’s Struggle with Cross‑Video Reasoning

CrossVid is an open‑source benchmark that evaluates multimodal large language models on cross‑video reasoning, offering 5,331 videos and 9,015 high‑quality QA pairs across four reasoning dimensions, and revealing that even the strongest models achieve only about 50% accuracy compared with human performance.

AI Evaluationcross-video reasoningvideo understanding
0 likes · 9 min read
CrossVid: The New Benchmark Exposing AI’s Struggle with Cross‑Video Reasoning
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Dec 4, 2025 · Artificial Intelligence

CrossVid: A New Benchmark Reveals the Limits of Multimodal LLMs in Cross‑Video Reasoning

CrossVid is an open‑source benchmark that evaluates multimodal large language models on cross‑video reasoning tasks, providing 5,331 videos, 9,015 QA pairs, four high‑level dimensions and ten specific tasks, and exposing significant performance gaps between current models and humans.

AI Evaluationcross-video reasoningmultimodal LLM
0 likes · 9 min read
CrossVid: A New Benchmark Reveals the Limits of Multimodal LLMs in Cross‑Video Reasoning
Kuaishou Tech
Kuaishou Tech
Nov 28, 2025 · Artificial Intelligence

Keye-VL-671B-A37B Leads Vision, Video, and Math Benchmarks

Kwai has open‑sourced its new flagship multimodal model Keye‑VL‑671B‑A37B, which upgrades visual perception, cross‑modal alignment and complex reasoning, achieving top scores on image, video, and mathematical reasoning benchmarks while detailing its architecture, three‑stage pre‑training, post‑training strategies, and future multimodal agent plans.

Deep Learninglarge language modelmultimodal
0 likes · 10 min read
Keye-VL-671B-A37B Leads Vision, Video, and Math Benchmarks
AntTech
AntTech
Oct 28, 2025 · Artificial Intelligence

Ming-Flash-Omni-Preview: 103B Open-Source Multimodal Model Excelling in Image, Video, and Speech

Introducing Ming‑Flash‑Omni‑Preview, a 103‑billion‑parameter open‑source multimodal model built on a sparse MoE architecture that delivers state‑of‑the‑art performance in controllable image generation, streaming video understanding, and context‑aware speech recognition, surpassing prior models on GenEval and GEdit benchmarks.

image generationlarge language modelmultimodal
0 likes · 8 min read
Ming-Flash-Omni-Preview: 103B Open-Source Multimodal Model Excelling in Image, Video, and Speech
Data Party THU
Data Party THU
Sep 26, 2025 · Artificial Intelligence

How Keye‑VL‑1.5 Redefines Video Understanding with Slow‑Fast Encoding

Keye‑VL‑1.5, an 8‑billion‑parameter multimodal large language model, introduces a Slow‑Fast video encoding strategy, a four‑stage progressive pre‑training pipeline with 128K context, and a sophisticated post‑training regime that together achieve state‑of‑the‑art performance on video and vision‑language benchmarks while maintaining strong general capabilities.

Benchmarklarge language modelmultimodal LLM
0 likes · 21 min read
How Keye‑VL‑1.5 Redefines Video Understanding with Slow‑Fast Encoding
Kuaishou Tech
Kuaishou Tech
Sep 5, 2025 · Artificial Intelligence

How Keye‑VL‑1.5‑8B Sets New Benchmarks in Multimodal AI

Fast‑search platform Kwai has open‑sourced the 8‑billion‑parameter multimodal LLM Keye‑VL‑1.5, which introduces a slow‑fast frame encoding, a progressive four‑stage pre‑training pipeline, and an automated data construction workflow, achieving state‑of‑the‑art results on video and vision‑language benchmarks and surpassing many closed‑source models.

Multimodal AIbenchmark performancelarge language model
0 likes · 12 min read
How Keye‑VL‑1.5‑8B Sets New Benchmarks in Multimodal AI
Kuaishou Large Model
Kuaishou Large Model
Jun 5, 2025 · Artificial Intelligence

7 Kuaishou Papers Accepted at ACL 2025 Reveal Cutting‑Edge AI Advances

Kuaishou's foundational large‑model team secured seven papers at the prestigious ACL 2025 conference, covering alignment bias during model training, safety in inference, decoding strategies, fine‑grained video‑temporal understanding, and new evaluation benchmarks that push the frontier of multimodal large language models.

ACL 2025BenchmarkMultimodal AI
0 likes · 16 min read
7 Kuaishou Papers Accepted at ACL 2025 Reveal Cutting‑Edge AI Advances
Kuaishou Tech
Kuaishou Tech
Jun 5, 2025 · Artificial Intelligence

7 Kuaishou AI Papers Accepted at ACL 2025: Video Understanding & Safe LLM Decoding

Kuaishou’s foundational large-model team has secured seven papers at ACL 2025, spanning alignment bias in training, safety defenses during inference, decoding strategies, fine-grained video-temporal understanding, reward fairness in RLHF, multimodal captioning benchmarks, and methods to curb hallucinations in vision-language models.

ACLAI SafetyBenchmark
0 likes · 13 min read
7 Kuaishou AI Papers Accepted at ACL 2025: Video Understanding & Safe LLM Decoding
AIWalker
AIWalker
Mar 17, 2025 · Artificial Intelligence

How UNIFIEDREWARD Breaks Task Boundaries to Boost Image and Video Performance

The paper introduces UNIFIEDREWARD, the first unified reward model for multimodal understanding and generation that supports pairwise ranking and pointwise scoring, builds a 236K human‑preference dataset across image and video tasks, and uses DPO to align VLMs and diffusion models, achieving significant performance gains on both image and video benchmarks.

Direct Preference OptimizationMultimodal AIPreference Modeling
0 likes · 19 min read
How UNIFIEDREWARD Breaks Task Boundaries to Boost Image and Video Performance
Alimama Tech
Alimama Tech
Nov 6, 2024 · Artificial Intelligence

How AI Generates Synchronized Video Narrations for E‑Commerce

This article presents the research behind Synchronized Video Storytelling, introducing the E‑SyncVidStory dataset, the VideoNarrator multimodal architecture, and extensive experiments that demonstrate high‑quality, product‑aware video narration generation for e‑commerce applications.

DatasetLLMMultimodal AI
0 likes · 12 min read
How AI Generates Synchronized Video Narrations for E‑Commerce
Baidu MEUX
Baidu MEUX
Jul 24, 2024 · Artificial Intelligence

What’s New in AI? Video QA, Audio Generation, and Major Industry Moves

This roundup highlights the latest AI breakthroughs, including Zhipu AI's video‑understanding model for temporal Q&A, Tencent's video‑to‑audio generation system, Vimeo's AI‑content labeling policy, Apple’s Core ML inclusion of ByteDance’s depth model, AMD’s acquisition of Silo AI, Claude’s new editing features, Quark’s all‑in‑one search AI, TikTok’s VR live streaming on Vision Pro, the launch of the "Xinliu" AI search assistant, and Canva’s restrictions on political AI‑generated posters.

AI modelsartificial intelligenceaudio generation
0 likes · 8 min read
What’s New in AI? Video QA, Audio Generation, and Major Industry Moves
DaTaobao Tech
DaTaobao Tech
Aug 21, 2023 · Artificial Intelligence

Action Sensitivity Learning for Temporal Action Localization

The paper presents Action Sensitivity Learning (ASL), a framework that models frame‑wise importance at both class‑level (via learnable Gaussian distributions) and instance‑level (using quality scores), integrates these weights into classification and regression losses, adds a contrastive InfoNCE term, and achieves state‑of‑the‑art temporal action localization performance across six benchmark datasets.

Action Sensitivity LearningComputer VisionDeep Learning
0 likes · 8 min read
Action Sensitivity Learning for Temporal Action Localization
HomeTech
HomeTech
Jul 7, 2023 · Artificial Intelligence

Multi-Modal Video Understanding and AIGC Video Generation at Autohome

This article presents a comprehensive multi-modal video understanding system for AIGC video generation, detailing technical architecture, GCN-based semi-supervised learning, and practical applications across automotive content scenarios.

AIGCBERTNeXtVLAD
0 likes · 8 min read
Multi-Modal Video Understanding and AIGC Video Generation at Autohome
DataFunSummit
DataFunSummit
Jun 22, 2022 · Artificial Intelligence

Generating and Applying Social Relationship Graphs for Video Understanding

This talk presents recent research on integrating dynamic analysis and graph machine learning to generate social relationship graphs from video, detailing hierarchical graph convolution networks, multimodal feature fusion, weakly supervised training, experimental results, and applications such as enhanced video retrieval and storyline understanding.

Graph Neural NetworkWeak Supervisionsocial relationship graph
0 likes · 11 min read
Generating and Applying Social Relationship Graphs for Video Understanding
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jun 20, 2022 · Artificial Intelligence

Action Sequence Verification in Videos with CosAlignment Transformer (CAT)

The paper introduces Action Sequence Verification (ASV), a task that determines whether two videos follow the same ordered actions, provides the Chemical Sequence Verification dataset and re‑annotated COIN‑SV and Diving48‑SV sets, and proposes the CosAlignment Transformer (CAT) with intra‑step feature extraction, a Transformer‑based inter‑step encoder, and a sequence‑alignment loss that outperforms prior baselines and serves as a pre‑training model for video retrieval and classification.

Action VerificationComputer VisionDataset
0 likes · 7 min read
Action Sequence Verification in Videos with CosAlignment Transformer (CAT)
DataFunTalk
DataFunTalk
May 20, 2022 · Artificial Intelligence

Hierarchical Graph Convolutional Networks for Video Social Relationship Modeling

This article presents a multimodal approach that combines dynamic analysis and graph machine learning to generate and apply social relationship graphs in videos, detailing problem background, graph generation modules, applications such as video retrieval, experimental results, and future research directions.

AIGraph Neural NetworkWeak Supervision
0 likes · 11 min read
Hierarchical Graph Convolutional Networks for Video Social Relationship Modeling
AntTech
AntTech
Oct 19, 2021 · Artificial Intelligence

Target Re‑identification and Occluded Video Instance Segmentation: Applications in Insurance Claims and Pet Identification

The article introduces pet identity verification using target re‑identification and occluded video instance segmentation, describes recent ICCV VIPriors competitions where Ant Group’s insurance team achieved top ranks, and explains how these computer‑vision techniques are applied to insurance claims, pet identification, and future AI scenarios.

Insurance AITarget Re-identificationinstance segmentation
0 likes · 7 min read
Target Re‑identification and Occluded Video Instance Segmentation: Applications in Insurance Claims and Pet Identification
Tencent Advertising Technology
Tencent Advertising Technology
May 28, 2021 · Artificial Intelligence

Insights from the Tencent Advertising Algorithm Competition: Model Framework and Optimization Strategies

The article shares a Tencent competition champion’s practical TensorFlow‑based video ad solution, detailing data handling, model architecture, optimization tricks, multimodal fusion techniques, and experimental observations to help participants improve performance in the 2021 Tencent Advertising Algorithm Contest.

TensorFlowadvertising algorithmcompetition
0 likes · 7 min read
Insights from the Tencent Advertising Algorithm Competition: Model Framework and Optimization Strategies
Youku Technology
Youku Technology
Mar 23, 2021 · Artificial Intelligence

Text-Video Alignment Algorithm for Automated Short Video Production at Youku

Youku’s new text‑video alignment system automatically generates short video summaries by extracting multimodal video and linguistic features, matching sentences to clips through embedding and tag‑level models, and enabling AI‑driven auto‑editing that cuts production time from days to minutes.

BERTNLPcross-modal matching
0 likes · 10 min read
Text-Video Alignment Algorithm for Automated Short Video Production at Youku
iQIYI Technical Product Team
iQIYI Technical Product Team
Aug 7, 2020 · Artificial Intelligence

Boundary Content Graph Neural Network (BC‑GNN) for Temporal Action Proposal Generation

The Boundary Content Graph Neural Network (BC‑GNN) introduces a bipartite‑graph framework that jointly refines start/end boundary probabilities and segment‑content confidence, enabling more precise temporal action proposals and achieving state‑of‑the‑art results on ActivityNet‑1.3 and THUMOS14.

BC-GNNComputer VisionDeep Learning
0 likes · 10 min read
Boundary Content Graph Neural Network (BC‑GNN) for Temporal Action Proposal Generation
ITPUB
ITPUB
Aug 7, 2020 · Artificial Intelligence

How BC‑GNN Improves Temporal Action Proposals with Boundary‑Content Graph Modeling

The paper introduces Boundary Content Graph Neural Network (BC‑GNN), a graph‑based approach that jointly models boundary and content predictions to generate more accurate temporal action proposals and reliable confidence scores, achieving state‑of‑the‑art results on ActivityNet‑1.3 and THUMOS‑14.

BC-GNNECCV2020Graph Neural Network
0 likes · 12 min read
How BC‑GNN Improves Temporal Action Proposals with Boundary‑Content Graph Modeling
DataFunTalk
DataFunTalk
Jul 26, 2019 · Artificial Intelligence

Hulu’s Video Content Understanding: Challenges, Practices, and Applications

This article summarizes Hulu Chief Research Officer Xie Xiaohui’s presentation on why video content understanding is essential, the technical challenges involved, and Hulu’s end‑to‑end solutions—including fine‑grained segmentation, logo and subtitle detection, automated pipelines, tagging taxonomy, content generation, and vector embeddings—to improve recommendation, advertising, and search for massive video libraries.

AIHulucontent tagging
0 likes · 14 min read
Hulu’s Video Content Understanding: Challenges, Practices, and Applications
Youku Technology
Youku Technology
Nov 2, 2018 · Artificial Intelligence

How AI Powers Next‑Gen Multimedia Content Retrieval: From OCR to Knowledge Graphs

This article examines the evolution of search, defines multimedia content retrieval, explores user scenarios such as voice, image, and video input, and details key AI techniques—including OCR, face recognition, and content knowledge graphs—that enable semantic understanding and ranking of video content.

Knowledge GraphOCRface recognition
0 likes · 12 min read
How AI Powers Next‑Gen Multimedia Content Retrieval: From OCR to Knowledge Graphs