Artificial Intelligence 19 min read

Xiaomi Scores 14 Papers at CVPR 2026, Showcasing Breakthroughs in Large Models and Autonomous Driving

CVPR 2026 accepted 14 Xiaomi papers spanning long‑video understanding, multimodal reasoning, GUI agents, and autonomous driving, each accompanied by arXiv and GitHub links, and introducing novel frameworks such as REVISOR, EMO‑R3, TimeViper, MSJoE, SafeGRPO, GUI‑CEval, ProactiveMobile, ParkGaussian, UFO, TraqPoint, SimScale, MeanFuser and DVGT.

Xiaomi Tech

Mar 3, 2026

Xiaomi Scores 14 Papers at CVPR 2026, Showcasing Breakthroughs in Large Models and Autonomous Driving

Overview

CVPR (IEEE/CVF Conference on Computer Vision and Pattern Recognition) announced its 2026 paper acceptance list, and Xiaomi had 14 papers selected across AI large‑model research, multimodal video understanding, autonomous driving, safety alignment, and benchmark creation. The accepted works demonstrate concrete technical advances and provide open‑source code and arXiv preprints.

REVISOR: Multimodal Introspective Reasoning for Long‑Form Video

Long‑video understanding suffers from limited textual reflection; visual information is richer and cannot be corrected by text alone. REVISOR introduces a tool‑enhanced two‑stage framework: the initial reasoning stage generates a coarse trajectory and identifies key video segments; a visual toolbox then densely resamples those segments for fine evidence. In the reflection stage the model iteratively refines its answer using both the initial trajectory and new evidence. Training employs a Dual‑Attribution Decoupled Reward (DADR) that adds a causal‑segment sufficiency reward to the usual answer verification reward, encouraging the model to align its reasoning with causally relevant visual cues. Experiments show REVISOR improves average accuracy by ~2 % on several long‑video benchmarks without extra supervision or external models.

EMO‑R3: Reflective Reinforcement Learning for Emotional Reasoning

Multimodal large language models (MLLMs) struggle with the complexity of human emotions and lack explainability. EMO‑R3 proposes a reflective reinforcement‑learning framework that structures emotional reasoning into three stages—trigger factor identification, response inference, and polarity/activation judgment. A reflective emotional reward guides the model to align visual‑text consistency and logical emotional flow. Benchmarks demonstrate significant gains in emotional reasoning accuracy and interpretability, both in‑domain and cross‑domain.

TimeViper: Hybrid Mamba‑Transformer for Efficient Long Video Understanding

TimeViper combines a Mamba‑Transformer hybrid architecture to handle long‑range temporal context efficiently. The authors observe that both Mamba and Transformer token pipelines cause visual token redundancy. They introduce TransV, which compresses visual token information into textual tokens inside the large language model. Experiments show TimeViper extends input length and improves inference speed while matching Transformer‑based performance on multiple benchmarks.

MSJoE: Joint Evolution of MLLM and Sampler

Uniform sampling of long videos misses key frames and incurs high computation; similarity‑based heuristics lack model‑inference coordination. MSJoE proposes a “query generation → similarity modeling → joint reinforcement learning” loop that co‑evolves the multimodal large language model and a frame sampler. The method dramatically reduces input frames while consistently outperforming uniform and existing learning‑based samplers on long‑video benchmarks.

SafeGRPO: Rule‑Governed Self‑Rewarded Safety Alignment

Multimodal LLMs can produce unsafe semantics when text and image modalities interact, even if each input is benign. SafeGRPO integrates rule‑driven rewards into the GRPO self‑rewarded optimization pipeline, using the SafeTag‑VL‑3K dataset that contains visual, textual, and compositional safety tags. The framework guides models through structured safety reasoning, improving multimodal safety awareness, compositional robustness, and inference stability without sacrificing general capability.

GUI‑CEval: Comprehensive Chinese Benchmark for Mobile GUI Agents

Existing GUI benchmarks focus on English apps, single‑skill evaluation, and synthetic data. GUI‑CEval introduces a hierarchical, five‑dimensional (perception, planning, reflection, execution, evaluation) benchmark covering 201 mainstream Chinese apps on four device classes, with multi‑stage human verification. Evaluation of 20 MLLMs and agents reveals strong perception but notable gaps in reflection and post‑action self‑assessment, providing a diagnostic tool for future improvements.

ProactiveMobile: Benchmark for Proactive Intelligence on Mobile Devices

Current mobile agents reactively follow user commands. ProactiveMobile formalizes proactive tasks as intent inference from four context types and generation of executable API call sequences from a 63‑API library. The benchmark contains 14 real‑world scenarios and over 3,600 instances, with multi‑answer annotation and expert review. Experiments show Qwen2.5‑VL‑7B achieves 19.15 % success, outperforming o1 (15.71 %) and GPT‑5 (7.39 %), highlighting the need for proactive capabilities.

ParkGaussian: 3D Gaussian Splatting for Autonomous Parking

Parking in narrow, low‑light underground garages lacks 3D reconstruction and aligned simulation. ParkRecon3D provides a 4‑view fisheye dataset with dense parking‑spot annotations. ParkGaussian adapts 3D Gaussian splatting to fisheye input, projects to BEV via differentiable IPM, and employs a slot‑aware strategy that uses a pretrained spot detector to constrain key regions, achieving high‑quality reconstruction aligned with downstream perception tasks.

UFO: Unified Feed‑Forward and Optimization‑Based Large Driving Scene Modeling

Existing forward models face quadratic complexity with long sequences and struggle with dynamic objects. UFO maintains an iteratively updatable 4D scene representation, uses visibility‑based filtering for efficient long‑sequence handling, and introduces object‑pose‑guided modeling for accurate long‑term dynamics. On Waymo data, UFO reconstructs a 16‑second log in 0.5 s with superior visual and geometric quality compared to scene‑wise optimization and other feed‑forward baselines.

TraqPoint: Track‑Aware Policy Gradients for Keypoint Detection

Traditional keypoint detectors optimize pairwise matching and ignore long‑term trackability, leading to drift under large viewpoint or illumination changes. TraqPoint reframes keypoint detection as a sequential decision problem, using a two‑stage “describe‑then‑detect” training pipeline. It adds a trajectory reward that balances feature stability, ranking, and uniqueness, and employs global‑grid mixed sampling to improve saliency and coverage. Results show significant gains in pose estimation, visual odometry, and 3D reconstruction.

SimScale: Scalable Real‑World Simulation for Driving

Critical safety and tail‑case scenarios are scarce in real driving logs. SimScale builds a large‑scale simulation framework that combines neural rendering with reactive environments to synthesize high‑fidelity multi‑view data from existing logs. A “pseudo‑expert” trajectory generator provides high‑quality behavior supervision. Scaling experiments on NAVSIM demonstrate robust performance gains, offering a data‑efficient path to closed‑loop autonomous‑driving training without real‑vehicle collection.

MeanFuser: Fast One‑Step Multi‑Modal Trajectory Generation

Current trajectory planners rely on discrete anchor vocabularies and multi‑step sampling, limiting speed and continuity. MeanFuser introduces Gaussian Mixture Noise (GMN) to capture diverse motion modes and adopts the MeanFlow paradigm to model average velocity fields for single‑step sampling. An Adaptive Reconstruction Module (ARM) evaluates and refines proposals. On NAVSIM, MeanFuser runs at 59 FPS and outperforms GoalFlow and DiffusionDrive without PDMS supervision.

DVGT: Driving Visual Geometry Transformer

Vision‑geometry models usually need precise camera parameters and explicit projection, restricting generality across cameras and complex driving scenes. DVGT is a data‑driven spatio‑temporal geometry transformer that leverages cross‑view and cross‑frame attention to reconstruct metric‑scale 3D point clouds and predict vehicle pose directly from raw multi‑view sequences, without external sensors or post‑alignment. Evaluated on five major datasets, DVGT consistently exceeds specialized models in reconstruction accuracy and generalization.

All papers include arXiv links and open‑source code repositories (e.g., https://github.com/xiaomi-research/revisor, https://github.com/xiaomi-research/emo‑r3, https://github.com/xiaomi-research/timeviper, etc.), enabling reproducibility and further research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

benchmark autonomous driving multimodal large language models Xiaomi CVPR 2026 safety alignment Long Video Understanding

Written by

Xiaomi Tech

Chat about technology with Xiaomi and change life together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.