Turning Multi‑Teacher Conflict into Dynamic Constraints: Robust Reasoning Alignment for Multimodal LLMs (ICML 2026)

APO (Autonomous Preference Optimization) converts the drift and conflict among multiple teacher multimodal LLMs into dynamic negative constraints while treating consensus as a positive preference, enabling robust concept alignment and superior diagnostic accuracy on the CXR‑MAX benchmark, as demonstrated by extensive ICML‑2026 experiments.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Turning Multi‑Teacher Conflict into Dynamic Constraints: Robust Reasoning Alignment for Multimodal LLMs (ICML 2026)

01 Introduction

Current multimodal large‑model (MLLM) research shows that aggregating several teacher models—often called multi‑teacher knowledge distillation—can boost performance, but differences in architecture and optimization cause each teacher to follow a distinct inference trajectory. The authors observe "concept drift" across teachers, where inference distributions shift dramatically during a single reasoning session, leading to logical conflicts, hallucinations, and semantic inconsistency in the student model.

02 Method

The authors define the non‑stationary multi‑stream concept alignment problem and propose the APO (Autonomous Preference Optimization) framework. APO treats teacher drift as a dynamic negative constraint and consensus among teachers as a positive preference, turning multi‑teacher conflict into a constraint‑satisfaction problem.

Multi‑stream inference drift is formalized by modeling each teacher’s autoregressive trajectory as a sequence of states. For a single stream, the state at step j is represented by the generated token sequence

and the predictive distribution

. Extending to N streams, the collective state at step j is defined as

. Non‑stationarity is detected when the joint distribution changes between steps j and j+Δ (see

), indicating multi‑stream drift.

Supervised consensus synthesis first projects the student model into the union of all teacher capabilities, then extracts a consensus context by aggregating each teacher’s raw inference traces. The student acts as a discriminator, filtering out contradictory signals and amplifying the logical intersection, producing a coherent consensus trajectory.

Constraint‑aware preference optimization treats the consensus trajectory as a positive signal and the conflicting teacher traces as dynamic negative constraints. By extending Direct Preference Optimization (DPO), APO maximizes the probability of generating the consensus while explicitly suppressing drift patterns (see

and

). This dual objective forces the student to both increase consensus generation probability and reduce the presence of drift, effectively turning teacher conflict into a strong supervisory signal without external annotation.

03 Dataset Construction

To evaluate APO in a realistic non‑stationary setting, the authors build CXR‑MAX, a multimodal benchmark for chest‑X‑ray diagnosis. CXR‑MAX extends MIMIC‑CXR with inference traces from seven leading MLLMs (GPT‑5, Gemini‑2.5, Sonnet‑4, Grok‑4, Qwen‑VL‑MAX, GLM‑4.5V, Moonshot), totaling 170,982 instances covering 14 diseases.

04 Experimental Validation

APO‑trained 7B models achieve a mean accuracy of 0.78 across all disease categories, surpassing every teacher model—including GPT‑5—on the CXR‑MAX test set (see Table 1). In categories with extreme teacher disagreement (e.g., Consolidation and Edema, where teacher accuracies differ by >70 %), APO maintains top‑two performance, demonstrating stability and robustness. The results confirm that converting drift into dynamic constraints prevents bias and hallucination propagation, yielding reliable reasoning.

05 Conclusion

APO marks a shift from static knowledge distillation to dynamic constraint‑based learning for multimodal LLMs. By formalizing teacher drift as negative constraints and embedding concept alignment into a constraint‑satisfaction problem, APO advances robust reasoning alignment in high‑risk, rapidly changing domains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

knowledge distillationmultimodal LLMmedical imagingpreference optimizationconcept driftICML 2026APO
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.