Tagged articles

video understanding

31 articles · Page 1 of 1

Machine Learning Algorithms & Natural Language Processing

Jun 3, 2026 · Artificial Intelligence

Can Multimodal Models Ditch Frame Sampling? LLaVA‑OneVision‑2.0’s Codec‑Stream

LLaVA‑OneVision‑2.0 replaces uniform frame sampling with a codec‑stream visual unit, integrates a OneVision‑Encoder that tokenizes video as state‑plus‑incremental evidence, and demonstrates consistent gains on 18 video, 11 spatial‑reasoning and 4 tracking benchmarks while open‑sourcing its model, data and code.

JumpScoreLLaVA-OneVision-2.0Multimodal

0 likes · 17 min read

Can Multimodal Models Ditch Frame Sampling? LLaVA‑OneVision‑2.0’s Codec‑Stream

Machine Heart

Apr 27, 2026 · Artificial Intelligence

Why Traditional Video Captions Fail and How MTSS Solves the Problem

The article introduces Multi-Stream Scene Script (MTSS), a structured JSON‑based video description paradigm that replaces monolithic captions, explains its design principles, compares its advantages, and presents experimental evidence showing significant gains in both video understanding and generation tasks.

MTSSMultimodal AIstructured video description

0 likes · 8 min read

Why Traditional Video Captions Fail and How MTSS Solves the Problem

SuanNi

Apr 19, 2026 · Artificial Intelligence

Why Multimodal Video Models Still Miss the Mark: Inside the New Video‑MME‑v2 Benchmark

The Video‑MME‑v2 benchmark reveals that current multimodal video models, despite high leaderboard scores, struggle with genuine video understanding, thanks to a rigorous three‑layer evaluation, non‑linear scoring, and a meticulously curated 800‑video dataset that exposes their true intelligence limits.

AI evaluationVideo-MMElarge language models

0 likes · 10 min read

Why Multimodal Video Models Still Miss the Mark: Inside the New Video‑MME‑v2 Benchmark

Machine Heart

Apr 13, 2026 · Artificial Intelligence

Why the Top Video Model Scores Only 49: Introducing Video‑MME‑v2 by Nanjing University

The new Video‑MME‑v2 benchmark reveals that despite saturated high scores on existing video‑understanding tests, the strongest commercial model (Gemini‑3‑Pro) reaches only 49.4 points versus a human expert’s 90.7, highlighting the benchmark’s layered ability system, group‑level non‑linear scoring, and the nuanced impact of "Thinking" features.

AI evaluationlarge modelsmultimodal benchmark

0 likes · 11 min read

Why the Top Video Model Scores Only 49: Introducing Video‑MME‑v2 by Nanjing University

AI Explorer

Apr 7, 2026 · Artificial Intelligence

MedGRPO Redefines Med Video Understanding, Shifting AI from Assistant to Partner

MedGRPO, a multimodal large model, achieves a breakthrough in medical video understanding by introducing clinical semantic parsing that aligns visual cues with structured medical knowledge, boosting performance and raising ethical questions about AI’s evolving role from a supportive assistant to a collaborative clinical partner.

AI ethicsClinical Semantic Parsingmedical AI

0 likes · 6 min read

MedGRPO Redefines Med Video Understanding, Shifting AI from Assistant to Partner

AI Algorithm Path

Dec 23, 2025 · Artificial Intelligence

Fine‑Tuning Qwen‑Video‑8B with LLaMA‑Factory for Domain‑Specific Video Understanding

This article details how the Qwen‑Video‑8B model, built on Qwen3‑VL‑8B‑Instruct, is fine‑tuned with the LLaMA‑Factory framework using a curated city‑scenery dataset, addresses challenges of domain knowledge, temporal modeling and multimodal fusion, and demonstrates improved video captioning across baseline, English‑fine‑tuned and Chinese‑fine‑tuned versions.

AI fine-tuningLLaMA-FactoryLoRA

0 likes · 10 min read

Fine‑Tuning Qwen‑Video‑8B with LLaMA‑Factory for Domain‑Specific Video Understanding

AI Frontier Lectures

Dec 9, 2025 · Artificial Intelligence

CrossVid: The New Benchmark Exposing AI’s Struggle with Cross‑Video Reasoning

CrossVid is an open‑source benchmark that evaluates multimodal large language models on cross‑video reasoning, offering 5,331 videos and 9,015 high‑quality QA pairs across four reasoning dimensions, and revealing that even the strongest models achieve only about 50% accuracy compared with human performance.

AI evaluationcross-video reasoningvideo understanding

0 likes · 9 min read

CrossVid: The New Benchmark Exposing AI’s Struggle with Cross‑Video Reasoning

Xiaohongshu Tech REDtech

Dec 4, 2025 · Artificial Intelligence

CrossVid: A New Benchmark Reveals the Limits of Multimodal LLMs in Cross‑Video Reasoning

CrossVid is an open‑source benchmark that evaluates multimodal large language models on cross‑video reasoning tasks, providing 5,331 videos, 9,015 QA pairs, four high‑level dimensions and ten specific tasks, and exposing significant performance gaps between current models and humans.

AI evaluationcross-video reasoningmultimodal LLM

0 likes · 9 min read

CrossVid: A New Benchmark Reveals the Limits of Multimodal LLMs in Cross‑Video Reasoning

Kuaishou Tech

Nov 28, 2025 · Artificial Intelligence

Keye-VL-671B-A37B Leads Vision, Video, and Math Benchmarks

Kwai has open‑sourced its new flagship multimodal model Keye‑VL‑671B‑A37B, which upgrades visual perception, cross‑modal alignment and complex reasoning, achieving top scores on image, video, and mathematical reasoning benchmarks while detailing its architecture, three‑stage pre‑training, post‑training strategies, and future multimodal agent plans.

Large Language ModelMultimodaldeep learning

0 likes · 10 min read

Keye-VL-671B-A37B Leads Vision, Video, and Math Benchmarks

AntTech

Oct 28, 2025 · Artificial Intelligence

Ming-Flash-Omni-Preview: 103B Open-Source Multimodal Model Excelling in Image, Video, and Speech

Introducing Ming‑Flash‑Omni‑Preview, a 103‑billion‑parameter open‑source multimodal model built on a sparse MoE architecture that delivers state‑of‑the‑art performance in controllable image generation, streaming video understanding, and context‑aware speech recognition, surpassing prior models on GenEval and GEdit benchmarks.

Large Language ModelMultimodalSparse MoE

0 likes · 8 min read

Ming-Flash-Omni-Preview: 103B Open-Source Multimodal Model Excelling in Image, Video, and Speech

Data Party THU

Sep 26, 2025 · Artificial Intelligence

How Keye‑VL‑1.5 Redefines Video Understanding with Slow‑Fast Encoding

Keye‑VL‑1.5, an 8‑billion‑parameter multimodal large language model, introduces a Slow‑Fast video encoding strategy, a four‑stage progressive pre‑training pipeline with 128K context, and a sophisticated post‑training regime that together achieve state‑of‑the‑art performance on video and vision‑language benchmarks while maintaining strong general capabilities.

Large Language Modelbenchmarkmultimodal LLM

0 likes · 21 min read

How Keye‑VL‑1.5 Redefines Video Understanding with Slow‑Fast Encoding

Kuaishou Tech

Sep 5, 2025 · Artificial Intelligence

How Keye‑VL‑1.5‑8B Sets New Benchmarks in Multimodal AI

Fast‑search platform Kwai has open‑sourced the 8‑billion‑parameter multimodal LLM Keye‑VL‑1.5, which introduces a slow‑fast frame encoding, a progressive four‑stage pre‑training pipeline, and an automated data construction workflow, achieving state‑of‑the‑art results on video and vision‑language benchmarks and surpassing many closed‑source models.

Large Language ModelMultimodal AIbenchmark performance

0 likes · 12 min read

How Keye‑VL‑1.5‑8B Sets New Benchmarks in Multimodal AI

Kuaishou Large Model

Jun 5, 2025 · Artificial Intelligence

7 Kuaishou Papers Accepted at ACL 2025 Reveal Cutting‑Edge AI Advances

Kuaishou's foundational large‑model team secured seven papers at the prestigious ACL 2025 conference, covering alignment bias during model training, safety in inference, decoding strategies, fine‑grained video‑temporal understanding, and new evaluation benchmarks that push the frontier of multimodal large language models.

ACL 2025Multimodal AIbenchmark

0 likes · 16 min read

7 Kuaishou Papers Accepted at ACL 2025 Reveal Cutting‑Edge AI Advances

Kuaishou Tech

Jun 5, 2025 · Artificial Intelligence

7 Kuaishou AI Papers Accepted at ACL 2025: Video Understanding & Safe LLM Decoding

Kuaishou’s foundational large-model team has secured seven papers at ACL 2025, spanning alignment bias in training, safety defenses during inference, decoding strategies, fine-grained video-temporal understanding, reward fairness in RLHF, multimodal captioning benchmarks, and methods to curb hallucinations in vision-language models.

ACLAI safetyMultimodal

0 likes · 13 min read

7 Kuaishou AI Papers Accepted at ACL 2025: Video Understanding & Safe LLM Decoding

AIWalker

Mar 17, 2025 · Artificial Intelligence

How UNIFIEDREWARD Breaks Task Boundaries to Boost Image and Video Performance

The paper introduces UNIFIEDREWARD, the first unified reward model for multimodal understanding and generation that supports pairwise ranking and pointwise scoring, builds a 236K human‑preference dataset across image and video tasks, and uses DPO to align VLMs and diffusion models, achieving significant performance gains on both image and video benchmarks.

Direct Preference OptimizationMultimodal AIPreference Modeling

0 likes · 19 min read

How UNIFIEDREWARD Breaks Task Boundaries to Boost Image and Video Performance

Alimama Tech

Nov 6, 2024 · Artificial Intelligence

How AI Generates Synchronized Video Narrations for E‑Commerce

This article presents the research behind Synchronized Video Storytelling, introducing the E‑SyncVidStory dataset, the VideoNarrator multimodal architecture, and extensive experiments that demonstrate high‑quality, product‑aware video narration generation for e‑commerce applications.

LLMMultimodal AIdataset

0 likes · 12 min read

How AI Generates Synchronized Video Narrations for E‑Commerce

Baidu MEUX

Jul 24, 2024 · Artificial Intelligence

What’s New in AI? Video QA, Audio Generation, and Major Industry Moves

This roundup highlights the latest AI breakthroughs, including Zhipu AI's video‑understanding model for temporal Q&A, Tencent's video‑to‑audio generation system, Vimeo's AI‑content labeling policy, Apple’s Core ML inclusion of ByteDance’s depth model, AMD’s acquisition of Silo AI, Claude’s new editing features, Quark’s all‑in‑one search AI, TikTok’s VR live streaming on Vision Pro, the launch of the "Xinliu" AI search assistant, and Canva’s restrictions on political AI‑generated posters.

AI modelsartificial-intelligenceaudio generation

0 likes · 8 min read

What’s New in AI? Video QA, Audio Generation, and Major Industry Moves

DaTaobao Tech

Aug 21, 2023 · Artificial Intelligence

Action Sensitivity Learning for Temporal Action Localization

The paper presents Action Sensitivity Learning (ASL), a framework that models frame‑wise importance at both class‑level (via learnable Gaussian distributions) and instance‑level (using quality scores), integrates these weights into classification and regression losses, adds a contrastive InfoNCE term, and achieves state‑of‑the‑art temporal action localization performance across six benchmark datasets.

Action Sensitivity LearningTemporal Action Localizationcomputer vision

0 likes · 8 min read

Action Sensitivity Learning for Temporal Action Localization

HomeTech

Jul 7, 2023 · Artificial Intelligence

Multi-Modal Video Understanding and AIGC Video Generation at Autohome

This article presents a comprehensive multi-modal video understanding system for AIGC video generation, detailing technical architecture, GCN-based semi-supervised learning, and practical applications across automotive content scenarios.

AIGCBERTNeXtVLAD

0 likes · 8 min read

Multi-Modal Video Understanding and AIGC Video Generation at Autohome

DataFunSummit

Jun 22, 2022 · Artificial Intelligence

Generating and Applying Social Relationship Graphs for Video Understanding

This talk presents recent research on integrating dynamic analysis and graph machine learning to generate social relationship graphs from video, detailing hierarchical graph convolution networks, multimodal feature fusion, weakly supervised training, experimental results, and applications such as enhanced video retrieval and storyline understanding.

Graph Neural NetworkWeak Supervisionsocial relationship graph

0 likes · 11 min read

Generating and Applying Social Relationship Graphs for Video Understanding

Xiaohongshu Tech REDtech

Jun 20, 2022 · Artificial Intelligence

Action Sequence Verification in Videos with CosAlignment Transformer (CAT)

The paper introduces Action Sequence Verification (ASV), a task that determines whether two videos follow the same ordered actions, provides the Chemical Sequence Verification dataset and re‑annotated COIN‑SV and Diving48‑SV sets, and proposes the CosAlignment Transformer (CAT) with intra‑step feature extraction, a Transformer‑based inter‑step encoder, and a sequence‑alignment loss that outperforms prior baselines and serves as a pre‑training model for video retrieval and classification.

Action VerificationMultimodalTransformer

0 likes · 7 min read

Action Sequence Verification in Videos with CosAlignment Transformer (CAT)

DataFunTalk

May 20, 2022 · Artificial Intelligence

Hierarchical Graph Convolutional Networks for Video Social Relationship Modeling

This article presents a multimodal approach that combines dynamic analysis and graph machine learning to generate and apply social relationship graphs in videos, detailing problem background, graph generation modules, applications such as video retrieval, experimental results, and future research directions.

AIGraph Neural NetworkMultimodal

0 likes · 11 min read

Hierarchical Graph Convolutional Networks for Video Social Relationship Modeling

AntTech

Oct 19, 2021 · Artificial Intelligence

Target Re‑identification and Occluded Video Instance Segmentation: Applications in Insurance Claims and Pet Identification

The article introduces pet identity verification using target re‑identification and occluded video instance segmentation, describes recent ICCV VIPriors competitions where Ant Group’s insurance team achieved top ranks, and explains how these computer‑vision techniques are applied to insurance claims, pet identification, and future AI scenarios.

Target Re-identificationinstance segmentationinsurance AI

0 likes · 7 min read

Target Re‑identification and Occluded Video Instance Segmentation: Applications in Insurance Claims and Pet Identification

Tencent Advertising Technology

May 28, 2021 · Artificial Intelligence

Insights from the Tencent Advertising Algorithm Competition: Model Framework and Optimization Strategies

The article shares a Tencent competition champion’s practical TensorFlow‑based video ad solution, detailing data handling, model architecture, optimization tricks, multimodal fusion techniques, and experimental observations to help participants improve performance in the 2021 Tencent Advertising Algorithm Contest.

MultimodalTensorFlowadvertising algorithm

0 likes · 7 min read

Insights from the Tencent Advertising Algorithm Competition: Model Framework and Optimization Strategies

Youku Technology

Mar 23, 2021 · Artificial Intelligence

Text-Video Alignment Algorithm for Automated Short Video Production at Youku

Youku’s new text‑video alignment system automatically generates short video summaries by extracting multimodal video and linguistic features, matching sentences to clips through embedding and tag‑level models, and enabling AI‑driven auto‑editing that cuts production time from days to minutes.

BERTNLPcross-modal matching

0 likes · 10 min read

Text-Video Alignment Algorithm for Automated Short Video Production at Youku

iQIYI Technical Product Team

Aug 7, 2020 · Artificial Intelligence

Boundary Content Graph Neural Network (BC‑GNN) for Temporal Action Proposal Generation

The Boundary Content Graph Neural Network (BC‑GNN) introduces a bipartite‑graph framework that jointly refines start/end boundary probabilities and segment‑content confidence, enabling more precise temporal action proposals and achieving state‑of‑the‑art results on ActivityNet‑1.3 and THUMOS14.

BC-GNNcomputer visiondeep learning

0 likes · 10 min read

Boundary Content Graph Neural Network (BC‑GNN) for Temporal Action Proposal Generation

ITPUB

Aug 7, 2020 · Artificial Intelligence

How BC‑GNN Improves Temporal Action Proposals with Boundary‑Content Graph Modeling

The paper introduces Boundary Content Graph Neural Network (BC‑GNN), a graph‑based approach that jointly models boundary and content predictions to generate more accurate temporal action proposals and reliable confidence scores, achieving state‑of‑the‑art results on ActivityNet‑1.3 and THUMOS‑14.

BC-GNNECCV2020Graph Neural Network

0 likes · 12 min read

How BC‑GNN Improves Temporal Action Proposals with Boundary‑Content Graph Modeling

HomeTech

Mar 4, 2020 · Artificial Intelligence

Video Multi-Label Classification Using Graph Convolutional Networks

This paper introduces a method for video multi-label classification that incorporates label correlation features using graph convolutional networks, significantly improving classification performance.

GCNInceptionV3NeXtVLAD

0 likes · 7 min read

Video Multi-Label Classification Using Graph Convolutional Networks

DataFunTalk

Sep 18, 2019 · Artificial Intelligence

AI Applications in iQIYI Video Advertising: Scene Generation, Video Understanding, and Advertising Placement

This article explores how AI is used in iQIYI's video advertising pipeline to analyze video content, generate and recommend ad placement points, create scene‑aware ad creatives, build a video knowledge graph, and support various ad formats, ultimately improving ad relevance and revenue.

AIKnowledge Graphad placement

0 likes · 12 min read

AI Applications in iQIYI Video Advertising: Scene Generation, Video Understanding, and Advertising Placement

DataFunTalk

Jul 26, 2019 · Artificial Intelligence

Hulu’s Video Content Understanding: Challenges, Practices, and Applications

This article summarizes Hulu Chief Research Officer Xie Xiaohui’s presentation on why video content understanding is essential, the technical challenges involved, and Hulu’s end‑to‑end solutions—including fine‑grained segmentation, logo and subtitle detection, automated pipelines, tagging taxonomy, content generation, and vector embeddings—to improve recommendation, advertising, and search for massive video libraries.

AIHulucontent tagging

0 likes · 14 min read

Hulu’s Video Content Understanding: Challenges, Practices, and Applications

Youku Technology

Nov 2, 2018 · Artificial Intelligence

How AI Powers Next‑Gen Multimedia Content Retrieval: From OCR to Knowledge Graphs

This article examines the evolution of search, defines multimedia content retrieval, explores user scenarios such as voice, image, and video input, and details key AI techniques—including OCR, face recognition, and content knowledge graphs—that enable semantic understanding and ranking of video content.

Knowledge GraphOCRface recognition

0 likes · 12 min read

How AI Powers Next‑Gen Multimedia Content Retrieval: From OCR to Knowledge Graphs