Artificial Intelligence 18 min read

World’s First Open‑Source Large Model for Real‑World Medical Video Understanding

The article introduces the globally first open‑source large model uAI‑NEXUS‑MedVLM, built on the MedVidBench dataset and the MedGRPO training framework, which together overcome data scarcity, evaluation gaps, and task specialization challenges in surgical video AI, achieving state‑of‑the‑art performance across eight benchmark tasks.

Machine Heart

Apr 28, 2026

World’s First Open‑Source Large Model for Real‑World Medical Video Understanding

Applying AI in surgery demands extreme caution; recent Reuters investigations have highlighted incidents of misidentified anatomy and botched procedures when immature AI systems are deployed in operating rooms. The core technical challenge is that most existing video models are trained on generic visual data and cannot handle the spatial, temporal, and semantic complexities of real surgical footage.

To address these gaps, the research team released the world’s largest and most powerful medical video understanding model, uAI‑NEXUS‑MedVLM , together with a comprehensive benchmark called MedVidBench . MedVidBench comprises over 530k video‑instruction pairs drawn from eight specialized medical datasets (CholecT50, CholecTrack20, Cholec80‑CVS, CoPESD, AVOS, EgoSurgery, JIGSAWS, NurViD), covering laparoscopic, open, robotic, and nursing scenarios. The dataset is split into a large‑scale version (530k samples) and a balanced standard version (51.5k samples) to support both extensive experiments and efficient multitask learning.

The team’s training pipeline, named MedGRPO , combines supervised fine‑tuning (SFT) on Qwen2.5‑VL‑7B with a novel reinforcement‑learning (RL) stage. Two key innovations enable stable RL across heterogeneous tasks:

Cross‑Dataset Reward Normalization : Median‑based logistic scaling aligns reward magnitudes across datasets, ensuring that easy tasks do not drown out harder ones. This yields bounded, outlier‑robust rewards in the (0,1) interval.

Medical LLM Judge : A GPT‑4.1‑based evaluator scores model outputs on five clinical dimensions—terminology precision, instrument and anatomy identification, detail specificity, contextual awareness, and action/state accuracy—supplementing traditional semantic similarity metrics.

Experimental results show that the SFT‑only version of uAI‑NEXUS‑MedVLM already surpasses GPT‑4.1, Gemini‑2.5‑Flash, GPT‑5.4, and Gemini‑3.1 on all eight MedVidBench tasks. Notably, on the critical safety‑view (CVS) task the model achieves 89.4% accuracy—nearly 50× higher than GPT‑5.4’s 1.8%—and on the [email protected] metric it reaches three times the mIoU of Gemini‑3.1. Adding the RL stage with reward normalization further improves performance, allowing a 4B model with RL to exceed the 7B SFT baseline, demonstrating that efficient training can rival larger models.

Qualitative comparisons on a one‑minute “penicillin skin‑test” nursing video illustrate the gap: GPT‑5.4 produces detailed but hallucinated descriptions, Gemini‑3.1 mis‑timestamps actions, while the RL‑enhanced MedVLM accurately identifies “skin disinfection” and “intradermal injection” with correct anatomical references.

Beyond the model itself, the open‑source release includes the MedVidBench leaderboard on HuggingFace, inviting global developers to submit results and continuously update a unified ranking. This public benchmark aims to establish a trustworthy evaluation standard for medical video AI, addressing the industry’s need for a “single ruler” akin to ImageNet for vision or GLUE for language.

According to a Global Information market report, AI‑enhanced surgical video analysis is projected to grow from $730 million in 2025 to $910 million in 2026 (CAGR 24.1%), reaching $2.14 billion by 2030. The open‑source model and benchmark are positioned to accelerate that growth by providing a reproducible, clinically relevant foundation.

In summary, the combination of a large, meticulously annotated medical video dataset, cross‑dataset reward normalization, and a domain‑specific LLM judge enables the first large‑scale, high‑performing medical video understanding system, moving AI in surgery from “guesswork” toward reliable clinical assistance.

Multimodal AI Large Language Model benchmark reinforcement learning AI in Surgery Medical Video Understanding MedVidBench

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.