Turning Multi‑Teacher Conflict into Dynamic Constraints: Robust Reasoning Alignment for Multimodal LLMs (ICML 2026)
APO (Autonomous Preference Optimization) converts the drift and conflict among multiple teacher multimodal LLMs into dynamic negative constraints while treating consensus as a positive preference, enabling robust concept alignment and superior diagnostic accuracy on the CXR‑MAX benchmark, as demonstrated by extensive ICML‑2026 experiments.
01 Introduction
Current multimodal large‑model (MLLM) research shows that aggregating several teacher models—often called multi‑teacher knowledge distillation—can boost performance, but differences in architecture and optimization cause each teacher to follow a distinct inference trajectory. The authors observe "concept drift" across teachers, where inference distributions shift dramatically during a single reasoning session, leading to logical conflicts, hallucinations, and semantic inconsistency in the student model.
02 Method
The authors define the non‑stationary multi‑stream concept alignment problem and propose the APO (Autonomous Preference Optimization) framework. APO treats teacher drift as a dynamic negative constraint and consensus among teachers as a positive preference, turning multi‑teacher conflict into a constraint‑satisfaction problem.
Multi‑stream inference drift is formalized by modeling each teacher’s autoregressive trajectory as a sequence of states. For a single stream, the state at step j is represented by the generated token sequence
and the predictive distribution
. Extending to N streams, the collective state at step j is defined as
. Non‑stationarity is detected when the joint distribution changes between steps j and j+Δ (see
), indicating multi‑stream drift.
Supervised consensus synthesis first projects the student model into the union of all teacher capabilities, then extracts a consensus context by aggregating each teacher’s raw inference traces. The student acts as a discriminator, filtering out contradictory signals and amplifying the logical intersection, producing a coherent consensus trajectory.
Constraint‑aware preference optimization treats the consensus trajectory as a positive signal and the conflicting teacher traces as dynamic negative constraints. By extending Direct Preference Optimization (DPO), APO maximizes the probability of generating the consensus while explicitly suppressing drift patterns (see
and
). This dual objective forces the student to both increase consensus generation probability and reduce the presence of drift, effectively turning teacher conflict into a strong supervisory signal without external annotation.
03 Dataset Construction
To evaluate APO in a realistic non‑stationary setting, the authors build CXR‑MAX, a multimodal benchmark for chest‑X‑ray diagnosis. CXR‑MAX extends MIMIC‑CXR with inference traces from seven leading MLLMs (GPT‑5, Gemini‑2.5, Sonnet‑4, Grok‑4, Qwen‑VL‑MAX, GLM‑4.5V, Moonshot), totaling 170,982 instances covering 14 diseases.
04 Experimental Validation
APO‑trained 7B models achieve a mean accuracy of 0.78 across all disease categories, surpassing every teacher model—including GPT‑5—on the CXR‑MAX test set (see Table 1). In categories with extreme teacher disagreement (e.g., Consolidation and Edema, where teacher accuracies differ by >70 %), APO maintains top‑two performance, demonstrating stability and robustness. The results confirm that converting drift into dynamic constraints prevents bias and hallucination propagation, yielding reliable reasoning.
05 Conclusion
APO marks a shift from static knowledge distillation to dynamic constraint‑based learning for multimodal LLMs. By formalizing teacher drift as negative constraints and embedding concept alignment into a constraint‑satisfaction problem, APO advances robust reasoning alignment in high‑risk, rapidly changing domains.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
