AI Frontier Lectures
Author

AI Frontier Lectures

Leading AI knowledge platform

164
Articles
0
Likes
1
Views
0
Comments
Recent Articles

Latest from AI Frontier Lectures

100 recent articles max
AI Frontier Lectures
AI Frontier Lectures
Jan 5, 2026 · Artificial Intelligence

Can AI Really Understand Dynamic First‑Person Scenes? Inside the New EOC‑Bench

The article introduces EOC‑Bench, a pioneering benchmark that evaluates multimodal large language models on dynamic first‑person visual tasks across past, present, and future time dimensions, presents its 3,277 questions, novel multi‑scale temporal accuracy metric, extensive model comparisons, and detailed error analysis revealing current models’ limitations in temporal perception and memory.

MLLM evaluationdynamic perceptionmultimodal AI
0 likes · 10 min read
Can AI Really Understand Dynamic First‑Person Scenes? Inside the New EOC‑Bench
AI Frontier Lectures
AI Frontier Lectures
Dec 17, 2025 · Artificial Intelligence

Can OmniVGGT Unlock Multi‑Modal 3D Vision with Any Number of Inputs?

OmniVGGT introduces a flexible omni‑modality driven transformer that can ingest arbitrary numbers of geometric cues such as depth maps and camera parameters, achieving state‑of‑the‑art performance on diverse 3D tasks while keeping inference speed comparable to its RGB‑only predecessor.

3D visionMulti-ModalOmniVGGT
0 likes · 13 min read
Can OmniVGGT Unlock Multi‑Modal 3D Vision with Any Number of Inputs?
AI Frontier Lectures
AI Frontier Lectures
Dec 15, 2025 · Artificial Intelligence

How UnityVideo Unifies Multimodal Training to Boost Video Generation

UnityVideo, a new vision framework from HKUST, CUHK, Tsinghua and Kuaishou, unifies training across depth, flow, pose, segmentation and RGB modalities, achieving faster convergence, higher video quality, zero‑shot generalization and stronger physical reasoning compared with existing single‑modality video generators.

AI researchUnityVideoVision Models
0 likes · 15 min read
How UnityVideo Unifies Multimodal Training to Boost Video Generation
AI Frontier Lectures
AI Frontier Lectures
Dec 9, 2025 · Artificial Intelligence

CrossVid: The New Benchmark Exposing AI’s Struggle with Cross‑Video Reasoning

CrossVid is an open‑source benchmark that evaluates multimodal large language models on cross‑video reasoning, offering 5,331 videos and 9,015 high‑quality QA pairs across four reasoning dimensions, and revealing that even the strongest models achieve only about 50% accuracy compared with human performance.

AI evaluationcross-video reasoningvideo understanding
0 likes · 9 min read
CrossVid: The New Benchmark Exposing AI’s Struggle with Cross‑Video Reasoning
AI Frontier Lectures
AI Frontier Lectures
Dec 9, 2025 · Artificial Intelligence

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

This article analyzes why optimizing sequence‑level rewards for LLMs with token‑level surrogate objectives can improve reinforcement‑learning stability, explains the theoretical conditions required, introduces Routing Replay for MoE models, and presents extensive experiments validating the approach.

Importance SamplingMixture of Expertslarge language models
0 likes · 12 min read
Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive
AI Frontier Lectures
AI Frontier Lectures
Nov 28, 2025 · Artificial Intelligence

Can AI Generate the Next Step in a Video? Inside the VANS Model

Researchers from Kuaishou and Hong Kong City University introduce VANS, a novel Video-as-Answer system that predicts and visualizes the next event in a video by jointly optimizing a visual language model and a video diffusion model, enabling personalized step‑by‑step guidance and future scenario generation.

Video Generationfuture predictionjoint optimization
0 likes · 10 min read
Can AI Generate the Next Step in a Video? Inside the VANS Model
AI Frontier Lectures
AI Frontier Lectures
Nov 28, 2025 · Artificial Intelligence

How Meta’s SAM 3D Turns a Single Photo into Detailed 3D Models

Meta’s newly released SAM 3 and SAM 3D models enable single‑image 3D reconstruction and promptable segmentation, outperforming prior methods on benchmarks, introducing a shared perception encoder, a Presence Head to reduce hallucinations, and a two‑stage generation pipeline that produces high‑fidelity geometry and texture.

3D reconstructionMetaSAM 3
0 likes · 12 min read
How Meta’s SAM 3D Turns a Single Photo into Detailed 3D Models
AI Frontier Lectures
AI Frontier Lectures
Nov 25, 2025 · Artificial Intelligence

How RoMa v2 Achieves Harder, Better, Faster, Denser Feature Matching

RoMa v2 introduces a two‑stage matching‑then‑refinement pipeline powered by DINOv3 features, custom CUDA kernels, and diverse training data, delivering state‑of‑the‑art accuracy, speed, and pixel‑level uncertainty estimation across a wide range of dense matching benchmarks.

DINOv3RoMa v2benchmark results
0 likes · 10 min read
How RoMa v2 Achieves Harder, Better, Faster, Denser Feature Matching