Artificial Intelligence 7 min read

Top AAAI 2026 Papers: New Vision‑Language‑Action Model, LLM2CLIP and More

AAAI 2026 in Singapore showcased 23,680 submissions, highlighting breakthrough papers such as ReconVLA’s reconstructive vision‑language‑action model, LLM2CLIP’s language‑enhanced multimodal representation, a sheaflet‑based hypergraph neural network design, advances in description logic modeling, and a novel causal discovery method for dynamical systems.

PaperAgent

Jan 23, 2026

Top AAAI 2026 Papers: New Vision‑Language‑Action Model, LLM2CLIP and More

AAAI 2026 took place in Singapore from Jan 20‑27, receiving 23,680 submissions and accepting 4,167 papers (17.6% acceptance).

1. ReconVLA: Reconstructive Vision‑Language‑Action Model as Effective Robot Perceiver

Institutions: Hong Kong University of Science and Technology (Guangzhou), Westlake University, Zhejiang University, Monash University Authors: Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, Haoang Li

Paper: https://arxiv.org/abs/2508.10333 Code: https://github.com/OpenHelix-Team/ReconVLA ReconVLA introduces an implicit localization paradigm for vision‑language‑action (VLA) models by reconstructing gaze regions, enabling precise manipulation and strong generalization with only ~100 k trajectories.

Implicit Localization Architecture A “reconstructive VLA” paradigm aligns gaze regions with action targets, forcing the model to focus on critical visual areas and achieve fine‑grained representation learning.

Large‑Scale Pre‑training Foundation A dataset containing over 100 k trajectories and more than 2 M samples dramatically improves the generalization of visual reconstruction.

The model consists of a reconstruction branch and an action branch. Input consists of multi‑view images and textual instructions. The action branch outputs discrete action tokens, while the reconstruction branch outputs reconstruction tokens that denoise noisy gaze tokens into clean scene tokens, providing strong visual localization and fine‑grained understanding for precise actions.

2. LLM2CLIP: Powerful Language Model Unlocks Richer Cross‑Modality Representation

Institutions: Tongji University, Microsoft, Macquarie University Authors: Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Usman Naseem, Chunyu Wang, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, Lili Qiu, Liang Hu

Paper: https://arxiv.org/abs/2411.04997 Code: https://github.com/microsoft/LLM2CLIP/ LLM2CLIP first performs subtitle‑level contrastive fine‑tuning on a large language model (LLM) to greatly enhance its textual discrimination. The fine‑tuned LLM then serves as a teacher, providing richer, higher‑dimensional language supervision to CLIP’s visual encoder, overcoming the original text encoder’s context‑window and expressive limits. Experiments show consistent gains across diverse cross‑modal tasks.

3. High‑Pass Matters: Theoretical Insights and Sheaflet‑Based Design for Hypergraph Neural Networks

Institutions: Zhejiang Normal University, City University of Hong Kong, Nanyang Technological University, University of Cambridge Authors: Ming Li, Yujie Fang, Dongrui Shen, Han Feng, Xiaosheng Zhuang, Kelin Xia, Pietro Lio

Paper: https://arxiv.org/abs/xxxx.xxxxx The work provides theoretical analysis of high‑pass filters in hypergraph neural networks and proposes a sheaflet‑based architecture that improves expressive power for hypergraph data.

4. Model Change for Description Logic Concepts

Institutions: University of Oslo, Cardiff University Authors: Ana Ozaki, Jandson S Ribeiro

Paper: https://arxiv.org/abs/xxxx.xxxxx This paper studies model‑change operations for description‑logic concepts, offering new algorithms for efficient updates.

5. Causal Structure Learning for Dynamical Systems with Theoretical Score Analysis (CADYT)

Institutions: Bosch AI Center, Darmstadt University of Technology, Duale Hochschule Baden‑Württemberg, Institute for AI in Medicine (IKIM) Authors: Nicholas Tagliapietra, Katharina Ensinger, Christoph Zimmer, Osman Mian

Paper: https://arxiv.org/abs/2512.14361 CADYT addresses causal discovery in continuous‑time dynamical systems by building on differential causal models and employing Gaussian‑process inference. It uses a greedy search guided by Markov conditions and minimum description length to identify causal structures, outperforming discrete‑time baselines on both regularly and irregularly sampled data.

LLM multimodal Vision-Language causal discovery AAAI 2026 AI Papers

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.