500 M Videos Yield the Largest Open‑Source GUI Dataset; 3B Model Cuts Inference Tokens 71% and Beats Larger Models (Xiaomi AI at ICML 2026)
Xiaomi’s AI team extracted 5 billion video frames to create the world’s largest open‑source GUI dataset, demonstrated that a 3 B‑parameter model can reduce inference tokens by 71% while surpassing larger models, and presented a suite of ICML 2026 papers covering data scaling, benchmarking, reasoning, multimodal perception, and training stability for GUI agents and other AI tasks.
Overview
ICML 2026 announced 11 accepted papers from Xiaomi and collaborators, forming a comprehensive AI capability roadmap that spans from foundational training stability (MoE R3, SPARK) to capability layers (LED, VeriTime, Visual Para‑Thinker, Video‑OPD, GAD) and application layers (GUI Agent full stack, data, evaluation, inference architecture, visual content generation). The work showcases large‑scale data construction, systematic benchmarking, and novel reasoning architectures.
Video2GUI and WildGUI Dataset
The authors identified data scarcity as the primary bottleneck for training GUI agents. By mining 500 million internet tutorial videos, they built a two‑stage pipeline (metadata coarse filtering + content fine filtering) that selects 4.2 million high‑quality tutorials. Using Gemini‑3‑Pro, each video is converted into structured trajectories containing task instructions, action timestamps, and screen coordinates. The resulting WildGUI dataset comprises 12.7 million trajectories and 124.5 million screenshots, covering over 1,500 applications across five major platforms.
Pre‑training on WildGUI boosted MiMo‑VL‑7B’s OSWorld‑G score to 67.6, surpassing Qwen3‑VL‑32B and Seed1.5‑VL, and raised ScreenSpot‑Pro accuracy from 41.2 % to 56.9 % (≈38 % relative gain). Scaling experiments showed continuous performance improvement up to 200 B pre‑training tokens without saturation.
HyperTrack and GUIEvalKit
To address evaluation gaps, the team constructed HyperTrack , the largest Chinese mobile GUI navigation dataset with >16,000 real‑task trajectories across 674 Android apps and 17 categories. They also released GUIEvalKit , an open‑source evaluation toolkit integrating five major benchmarks and supporting unified offline and semi‑online evaluation for over 30 models. A decision‑level evaluation framework was introduced to move beyond simple execution accuracy toward behavior‑distribution analysis.
Key finding: Model performance scales roughly logarithmically with training data size, and GRPO‑based reinforcement learning fine‑tuning consistently outperforms supervised fine‑tuning at comparable data scales. However, richer inference modes expand the decision space while reducing stability, guiding practical optimization for Xiaomi’s mobile agents.
CoME: Channel‑of‑Mobile‑Experts
Traditional MoE routing mismatches the multi‑stage reasoning required for GUI agents. CoME proposes a four‑stage inference pipeline (screen summarization, sub‑task planning, action decision, function call) with dedicated experts for each stage and output‑oriented activation. Information‑gain metrics select the most effective reasoning paths, mitigating error propagation.
Experiments show CoME outperforms dense GUI agents and sparse MoE models while using fewer activation parameters and training data, enabling smaller, more stable on‑device agents.
LED: Latent‑Exploration‑Decoding
Post‑training reinforcement‑learning models often suffer “entropy collapse,” where higher sampling temperature fails to increase diversity. The authors observed that hidden states retain uncertainty. LED leverages this by sampling from the aggregated probability distribution of hidden layers during decoding, restoring exploration without modifying the model or adding parameters.
Across five models and six benchmarks, LED improves pass@k and enhances RL training effectiveness, enabling small models (3‑4 B) to match or exceed larger proprietary LLMs in inference capability.
VeriTime: Time‑Series Reasoning
Time‑series data is ubiquitous in IoT, EV batteries, smart homes, and industrial energy management. Existing LLMs lack dedicated training data and RL algorithms for temporal reasoning. VeriTime introduces a three‑stage pipeline: TSRgen synthesizes a process‑verifiable time‑series‑text multimodal dataset (TSRBench); a data‑scheduling mechanism orders samples by difficulty and task type; a two‑stage RL fine‑tuning uses fine‑grained, multi‑objective rewards for intermediate reasoning steps.
Results show a 71 % reduction in inference token consumption, allowing 3‑4 B models to achieve or surpass the performance of much larger LLMs on tasks such as battery anomaly detection, HVAC scheduling, and driving pattern analysis.
Visual Para‑Thinker
Current reasoning models often follow a single‑chain approach, limiting exploration in visual domains. Visual Para‑Thinker is the first parallel‑thinking framework for large multimodal models. It investigates block‑based and scan‑order image partitioning, introduces path‑aware attention and learnable parallel rotary positional encodings, and evaluates on V*, CountBench, RefCOCO, and HallusionBench. Both 3 B and 7 B models consistently outperform baseline sequential reasoning and majority‑vote methods.
Video‑OPD
Temporal video grounding is a core capability for video AI. Existing on‑policy RL methods suffer sparse sequence‑level rewards and high computational cost. Video‑OPD replaces sparse rewards with fine‑grained token‑level supervision from a teacher model while preserving online policy optimization. A teacher‑verification‑difference‑focused curriculum selects trajectories where the teacher is reliable but the student diverges most.
Video‑OPD surpasses state‑of‑the‑art GRPO by >17 % on average and generalizes strongly across broader video understanding benchmarks, delivering a superior efficiency‑performance trade‑off for applications like video editing, smart surveillance replay, and in‑car dashcam retrieval.
GAD: Geometry‑Aware Distillation
Diffusion model distillation often eliminates sensitivity to initial noise, causing mode collapse and reduced diversity. The authors attribute this to point‑wise output alignment that flattens response to input perturbations. GAD adds a Jacobian‑based response alignment regularizer, preserving local sensitivity without extra inference cost.
Across multiple generation architectures and distillation methods, GAD improves functional consistency, restores layout control, and mitigates the diversity‑fidelity trade‑off, enabling richer image generation for Xiaomi’s mobile photo‑editing and wallpaper creation features.
SPARK: Structured Progressive Knowledge Activation for NAS
LLM‑driven neural architecture search (NAS) often suffers “functional entanglement,” where simultaneous modifications to operators and calls cause code errors. SPARK introduces a “locate‑then‑modify” paradigm, partitioning code into mutually exclusive Operator and Action regions and editing only one per iteration.
Experiments show lower computational cost, higher accuracy, and better scalability, offering a practical path for Xiaomi’s edge‑AI, smart‑car, and smart‑home models to evolve efficiently.
Stabilizing MoE RL (R3)
Mixture‑of‑Experts (MoE) models excel in large LLMs but are unstable during RL fine‑tuning because routing decisions differ between training and inference, leading to mismatched expert activation and gradient noise.
The proposed Rollout Routing Replay (R3) records routing distributions during inference and replays them during training, aligning the expert sets and reducing KL divergence. R3 improves stability across Qwen3‑MoE and DeepSeek‑V2‑Lite RL tasks with only ~3.45 % training speed overhead and modest memory increase.
Conclusion
Collectively, these works illustrate Xiaomi AI’s transition from isolated breakthroughs to a systematic, end‑to‑end capability stack, covering data generation, benchmarking, reasoning, multimodal perception, and training stability. The research underpins upcoming product features such as intelligent photo organization, voice‑aware assistants, autonomous vehicle perception, and on‑device AI assistants.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
