Xiaomi’s AI Research Secures Spots on ICLR 2026 – Papers and Key Findings
The International Conference on Learning Representations (ICLR) 2026 accepted multiple Xiaomi papers covering multimodal reasoning, reinforcement learning, GUI agents, autonomous driving, audio generation and benchmark design, each presenting novel frameworks, data‑centric training tricks and strong experimental results that advance the state of the art.
ICLR 2026 Acceptance Overview
ICLR, founded by Yoshua Bengio and Yann LeCun, announced its 2026 paper list, and several Xiaomi research works were accepted. The accepted papers span multimodal large‑language models, reinforcement learning, mobile GUI agents, autonomous driving, audio generation, and benchmark construction.
Shuffle‑R1: Efficient RL Framework for Multimodal LLMs
The authors identify two long‑overlooked issues in RL‑based fine‑tuning of multimodal models: Advantage Collapsing (most advantage values cluster near zero, weakening gradient signals) and Rollout Silencing (the number of trajectories that yield non‑zero gradients drops during training). To address these, Shuffle‑R1 introduces (1) Pairwise Trajectory Sampling , which selects high‑advantage trajectory pairs to boost gradient quality, and (2) Advantage‑based Batch Shuffle , a batch‑reordering algorithm that reshapes data distribution to expose more valuable trajectories. Experiments on several multimodal reasoning benchmarks show that Shuffle‑R1 consistently outperforms multiple RL baselines with negligible extra computation.
MobileIPL: Iterative Preference Learning for Mobile GUI Agents
Mobile GUI agents using the CoaT (Chain of Action‑Planning Thoughts) paradigm suffer from (1) scarcity of high‑quality, diverse CoaT trajectories and (2) reliance on end‑result supervision, which cannot finely constrain intermediate reasoning steps. MobileIPL proposes (1) Thinking‑level DPO (T‑DPO) , which iteratively samples CoaT trees, scores leaf nodes with rule‑based rewards, and back‑propagates sparse result signals to intermediate steps, automatically constructing high‑quality preference pairs; and (2) Instruction Evolution , a three‑stage generate‑filter pipeline that expands task distribution and mitigates warm‑up SFT over‑fitting. The method achieves SOTA on AITZ, AMEX, AndroidControl and shows superior OOD robustness.
FutureMind: Strategic Thinking‑Pattern Priors via Adaptive Knowledge Distillation
Small language models (SLMs) struggle with multi‑hop reasoning and complex retrieval. FutureMind distills strategic thinking patterns from large language models without extra training or parameters. It extracts high‑level cognition (problem analysis, condition sorting, strategy planning, retrieval decision) and builds a dynamic reasoning pipeline composed of analysis, logical reasoning, planning, and retrieval modules, supported by three retrieval paradigms (forward, backward, parallel). Experiments on multi‑hop QA benchmarks demonstrate SOTA performance over strong baselines such as Search‑o1, while highlighting remaining bottlenecks due to teacher‑student cognitive gaps.
ThinkOmni: Training‑Free Omni‑modal Reasoning via Guidance Decoding
Current omni‑modal models excel at perception but lack deep logical reasoning, leading to a “strong perception, weak reasoning” imbalance. ThinkOmni introduces a training‑free framework that attaches a pre‑trained reasoning LLM as a guide to perception models. It consists of LRM‑as‑a‑Guide (leveraging a large reasoning model to steer OLLM decoding) and Stepwise Contrastive Scaling (balancing perception and reasoning signals). The approach yields consistent gains across six multimodal reasoning benchmarks.
SMAN‑Bench: Cross‑System Benchmark for Mobile Agents
To resolve the “unstable online environment vs. overly uniform offline trajectories” dilemma, SMAN‑Bench builds a large‑scale graph‑structured corpus (Mobile3M) and proposes a slot‑based instruction generation method (GIAS). This enables precise multi‑path reward evaluation offline and injects realistic ad noise and fuzzy instructions to simulate high‑fidelity mobile operations. The benchmark provides a rigorous platform for assessing planning ability, robustness to interference, and interactive intelligence of multimodal agents.
Flow2GAN: Hybrid Flow Matching and GAN for Few‑step High‑Fidelity Audio Generation
Existing audio generators rely on GANs (slow convergence) or diffusion‑based flow matching (multi‑step inference). Flow2GAN first pre‑trains a Flow Matching model with two modifications: (1) reformulating the objective as endpoint estimation to avoid optimizing velocity fields in empty‑energy regions, and (2) applying spectral‑energy‑based loss scaling to better model low‑energy (quiet) regions. A lightweight GAN fine‑tuning stage then turns the model into a single‑step generator. A multi‑branch network models Fourier coefficients at multiple time‑frequency resolutions, achieving higher fidelity than state‑of‑the‑art GAN and Flow Matching baselines while maintaining efficient inference.
ReCogDrive: Reinforced Cognitive Framework for End‑to‑End Autonomous Driving
End‑to‑end driving pipelines often output trajectories as language tokens, causing infeasible motions and low reasoning efficiency. ReCogDrive integrates a visual language model, diffusion‑based trajectory planning, and reinforcement learning. It injects human driving priors via a hierarchical cognition data pipeline, uses a cognition‑guided diffusion planner to map high‑level semantics to continuous actions, and refines policies with DiffGRPO RL in simulation. Experiments on NAVSIM and Bench2Drive show significant improvements over existing methods in both open‑loop and closed‑loop evaluations.
WorldSplat: Gaussian‑Centric Feed‑Forward 4D Scene Generation for Autonomous Driving
Current 4D driving scene generators lack 3D consistency and multi‑view controllability. WorldSplat proposes a two‑stage pipeline: (1) a 4D‑aware diffusion model that generates pixel‑aligned 4D Gaussians in a feed‑forward manner, and (2) an enhanced video diffusion model that refines rendered novel‑view videos from these Gaussians. Extensive experiments demonstrate high‑quality, temporally and spatially consistent multi‑trajectory driving videos across several benchmarks.
Dream4Drive: Rethinking Driving World Models as Synthetic Data Generators
The paper argues that more synthetic data does not automatically improve perception; instead, high‑quality, controllable synthetic data is key. Dream4Drive decomposes 3D perception‑guided maps, edits 3D assets, and renders world models to produce multi‑view, photorealistic driving videos. Using only 420 high‑quality synthetic samples (≈2 % of real data volume) yields perception models that surpass baselines trained on full real datasets.
DIPOLE: Dichotomous Diffusion Policy Optimization
DIPOLE revisits KL‑regularized RL objectives and introduces greedy policy regularization, splitting the optimal policy into reward‑maximizing and reward‑minimizing components. During inference, a linear combination of the two probability scores controls greediness. Experiments on ExORL, OGBench, a 10‑billion‑parameter VLA model, and the NAVSIM autonomous‑driving benchmark show consistent performance gains.
Overall, the accepted Xiaomi papers collectively advance data‑centric training, efficient inference, and benchmark design across a broad spectrum of AI sub‑fields.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
