Artificial Intelligence 13 min read

Seven Xiaomi AI Papers Accepted at AAAI 2026: Multimodal, Embodied & Database Advances

AAAI 2026 accepted seven Xiaomi research papers—two oral presentations—covering multimodal sound editing, embodied 3D agent scheduling, scalable Text-to-SQL schema linking, parallel speculative decoding, long‑form speech QA, high‑level spatial navigation, and VLM‑driven autonomous‑driving adversaries, each with concrete datasets, methods, and benchmark gains.

Xiaomi Tech

Dec 1, 2025

Seven Xiaomi AI Papers Accepted at AAAI 2026: Multimodal, Embodied & Database Advances

AAAI 2026, one of the top international conferences in artificial intelligence, announced 4,167 accepted papers (17.6% acceptance) out of a record 23,680 submissions. Xiaomi contributed seven recent research results, two of which were selected for oral talks, spanning audio‑visual editing, embodied intelligence, retrieval, inference decoding, speech QA, VLN navigation, and autonomous driving.

AV‑Edit: Multimodal Generative Sound‑Effect Editing via Audio‑Visual Semantic Joint Control

The paper identifies the limitation of traditional sound‑effect editing, which relies on low‑level signal processing or coarse text prompts, leading to inflexible edits and poor audio quality. AV‑Edit introduces a contrastive audio‑visual mask auto‑encoder (CAV‑MAE‑Edit) for multimodal pre‑training, learning aligned cross‑modal representations. These representations train an editing‑oriented multimodal diffusion Transformer (MM‑DiT) that removes visual‑irrelevant sounds and generates missing audio consistent with video content. Experiments show AV‑Edit produces high‑quality, precisely edited audio and achieves state‑of‑the‑art performance in both sound‑effect editing and audio generation.

GRANT: Embodied 3D Grounding Scheduling with Operations‑Research Knowledge

To address inefficiencies in embodied agents that cannot parallelize tasks (e.g., using a microwave while washing dishes), the authors define a new 3D Grounding scheduling task and construct the ORS3D‑60K dataset (60 K tasks from 4 K real scenes). GRANT, a multimodal large language model, employs a Scheduling Token Mechanism (STM): it first classifies task attributes (parallel vs. non‑parallel) and then invokes an external optimizer via a <SCH> token to generate an optimal execution sequence, which is injected back into the model. Compared with baselines, GRANT improves task‑scheduling efficiency by 30.53% and boosts 3D grounding accuracy.

AutoLink: Autonomous Schema Exploration and Expansion for Scalable Schema Linking in Text‑to‑SQL

In industrial‑scale Text‑to‑SQL, feeding the full database schema to a large model introduces noise and exceeds context limits. AutoLink mimics a database engineer’s iterative workflow: retrieve → explore → verify → expand, dynamically building a relevant schema subset. The method avoids full‑schema traversal and achieves 97.4% strict schema recall rate (SRR) on Bird‑Dev and 91.2% SRR on Spider 2.0‑Lite, both best‑in‑class, while substantially reducing token consumption even for databases with >3 000 columns.

SpecFormer: Parallel Draft‑Token Generation for Large‑Batch Speculative Decoding

Speculative decoding speeds up autoregressive LLM inference by letting a draft model guess future tokens. Existing approaches degrade with larger batch sizes because draft token count drops and serial draft generation becomes a bottleneck. SpecFormer stacks a unidirectional and a bidirectional Transformer layer and performs attention over both input tokens and draft tokens, enabling fully parallel draft token prediction. The model delivers stronger language modeling, higher draft quality, and better acceleration especially for medium‑large batch sizes.

CLSR: Contrastive Language‑Speech Retrieval for Long‑Form Spoken Question Answering

Current long‑audio QA systems struggle with length and modality gaps. CLSR introduces an intermediate step that converts acoustic features into text‑like representations before cross‑modal alignment, improving modality bridging. Experiments on four cross‑modal retrieval benchmarks show CLSR outperforms both end‑to‑end speech retrievers and traditional speech‑to‑text pipelines, establishing a solid foundation for practical long‑audio QA.

SpNav: Spatial Navigation with High‑Level Human Instructions

The paper proposes a new spatial‑navigation task where agents follow high‑level instructions such as “wait on the left side of the sofa”. Unlike prior work limited to object classification or detailed path commands, SpNav requires reasoning about spatial relations and supports two subtasks: Spatial Object Navigation (SpON) and Spatial Area Navigation (SpAN). The authors release a 10 000‑trajectory dataset and the SpNav hierarchical framework, which parses instructions with a vision‑language model, locates targets via a trained NaviPoint model, and executes actions through a Map‑to‑Action module. SpNav achieves state‑of‑the‑art navigation performance and demonstrates zero‑shot transfer to real environments.

VILTA: VLA‑in‑the‑Loop Trajectory Adversary for Enhancing Driving Policy Robustness

Open‑source autonomous‑driving datasets lack diversity in long‑tail scenarios, limiting policy robustness. VILTA embeds a Vision‑Language Model directly into the training loop, forming a “Vision‑Language‑Editing” paradigm that adversarially edits future trajectories of surrounding vehicles. A post‑processing step ensures kinematic feasibility. In CARLA simulations, VILTA‑augmented policies significantly reduce collision rates and improve robustness in extreme scenarios, offering a viable path for end‑to‑end driving agents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal AI Speculative Decoding Embodied AI Text-to-SQL Autonomous Driving Xiaomi AAAI 2026 Speech QA

Written by

Xiaomi Tech

Chat about technology with Xiaomi and change life together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.