Turning Multi-Teacher Conflict into Dynamic Constraints for Precise Multimodal Model Alignment (ICML 2026)

The paper introduces APO, a novel autonomous preference optimization framework that converts concept drift among multiple teacher multimodal LLMs into dynamic negative constraints and treats consensus as a positive preference, achieving robust concept alignment and surpassing strong teachers on a high‑risk medical X‑ray benchmark.

Machine Heart
Machine Heart
Machine Heart
Turning Multi-Teacher Conflict into Dynamic Constraints for Precise Multimodal Model Alignment (ICML 2026)

Introduction

Current multimodal large‑language‑model (MLLM) distillation assumes a single stable teacher, but analysis of seven mainstream MLLMs on a chest‑X‑ray diagnosis task reveals significant non‑stationary behavior: inference distributions shift dramatically across steps, causing concept drift, hallucinations, and semantic inconsistency when a student simply imitates the drifting teachers.

Method

The authors define the non‑stationary multi‑stream concept alignment problem and propose the Autonomous Preference Optimization (APO) framework. APO converts inter‑teacher drift into dynamic negative constraints while using consensus among teachers as a positive preference, thereby guiding the student model toward a tighter feature space.

Multi‑Stream Inference Drift

Each teacher’s autoregressive trajectory is formalized as a sequential stream

. The state at step j is

where

is the generated token prefix and

the current predictive distribution.

For N independent streams, the collective state at step j is

where

denotes the state of the u ‑th teacher. If the joint distribution evolves non‑stationarily across steps (i.e., the distribution at j differs from that at j+Δ ), multi‑stream drift is said to occur.

Assuming independence among teachers, the joint distribution factorizes as shown in

. Here

represents cumulative historical deviation of teacher outputs, while

captures instantaneous drift at the current step.

Supervised Consensus Synthesis

APO first performs supervised consensus synthesis, where the student absorbs heterogeneous knowledge from all teachers, projecting itself into the union of teacher capabilities. A context‑consensus extraction mechanism aggregates raw teacher trajectories (containing both useful signals and drift errors) into a reference context. The student, acting as a discriminator, filters out contradictory information lacking cross‑model support and amplifies the logical intersection, yielding a highly coherent consensus trajectory.

Constraint‑Aware Preference Optimization

With the consensus trajectory

as a positive signal, the original conflicting teacher paths

are reconstructed as dynamic negative constraints. APO extends Direct Preference Optimization (DPO) to jointly optimize the consensus (maximizing its generation probability) and suppress drift patterns, turning inter‑teacher conflicts into strong supervisory signals without external annotation.

Dataset Construction

To evaluate alignment under non‑stationary conditions, the authors build CXR‑MAX, a large‑scale benchmark for chest‑X‑ray diagnosis. CXR‑MAX extends MIMIC‑CXR with inference traces from seven leading MLLMs (GPT‑5, Gemini‑2.5, Sonnet‑4, Grok‑4, Qwen‑VL‑MAX, GLM‑4.5V, Moonshot), providing 170,982 instances covering 14 diseases.

Experimental Validation

Experiments include disease classification, report generation, chain‑of‑thought consistency, and generalization tests. Table 1 shows that a 7B student trained with APO attains a mean accuracy of 0.78, outperforming all teacher models—including GPT‑5—across disease categories. In highly divergent categories such as Consolidation and Edema, teacher accuracies differ by up to 70 %, yet APO’s student remains among the top two performers, demonstrating stability.

The results confirm that converting divergent teacher trajectories into dynamic constraints effectively blocks bias and erroneous knowledge, yielding robust and reliable reasoning.

Conclusion

APO advances multi‑teacher distillation from static learning to dynamic constraint satisfaction, formalizing teacher drift as negative constraints and embedding concept alignment as a constraint‑satisfaction problem. This enables robust reasoning alignment for multimodal models in high‑risk, rapidly changing domains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

knowledge distillationmultimodal LLMconcept driftICML 2026APOCXR-MAX
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.