Turning Multi-Teacher Conflict into Dynamic Constraints for Precise Multimodal Model Alignment (ICML 2026)
The paper introduces APO, a novel autonomous preference optimization framework that converts concept drift among multiple teacher multimodal LLMs into dynamic negative constraints and treats consensus as a positive preference, achieving robust concept alignment and surpassing strong teachers on a high‑risk medical X‑ray benchmark.
Introduction
Current multimodal large‑language‑model (MLLM) distillation assumes a single stable teacher, but analysis of seven mainstream MLLMs on a chest‑X‑ray diagnosis task reveals significant non‑stationary behavior: inference distributions shift dramatically across steps, causing concept drift, hallucinations, and semantic inconsistency when a student simply imitates the drifting teachers.
Method
The authors define the non‑stationary multi‑stream concept alignment problem and propose the Autonomous Preference Optimization (APO) framework. APO converts inter‑teacher drift into dynamic negative constraints while using consensus among teachers as a positive preference, thereby guiding the student model toward a tighter feature space.
Multi‑Stream Inference Drift
Each teacher’s autoregressive trajectory is formalized as a sequential stream
. The state at step j is
where
is the generated token prefix and
the current predictive distribution.
For N independent streams, the collective state at step j is
where
denotes the state of the u ‑th teacher. If the joint distribution evolves non‑stationarily across steps (i.e., the distribution at j differs from that at j+Δ ), multi‑stream drift is said to occur.
Assuming independence among teachers, the joint distribution factorizes as shown in
. Here
represents cumulative historical deviation of teacher outputs, while
captures instantaneous drift at the current step.
Supervised Consensus Synthesis
APO first performs supervised consensus synthesis, where the student absorbs heterogeneous knowledge from all teachers, projecting itself into the union of teacher capabilities. A context‑consensus extraction mechanism aggregates raw teacher trajectories (containing both useful signals and drift errors) into a reference context. The student, acting as a discriminator, filters out contradictory information lacking cross‑model support and amplifies the logical intersection, yielding a highly coherent consensus trajectory.
Constraint‑Aware Preference Optimization
With the consensus trajectory
as a positive signal, the original conflicting teacher paths
are reconstructed as dynamic negative constraints. APO extends Direct Preference Optimization (DPO) to jointly optimize the consensus (maximizing its generation probability) and suppress drift patterns, turning inter‑teacher conflicts into strong supervisory signals without external annotation.
Dataset Construction
To evaluate alignment under non‑stationary conditions, the authors build CXR‑MAX, a large‑scale benchmark for chest‑X‑ray diagnosis. CXR‑MAX extends MIMIC‑CXR with inference traces from seven leading MLLMs (GPT‑5, Gemini‑2.5, Sonnet‑4, Grok‑4, Qwen‑VL‑MAX, GLM‑4.5V, Moonshot), providing 170,982 instances covering 14 diseases.
Experimental Validation
Experiments include disease classification, report generation, chain‑of‑thought consistency, and generalization tests. Table 1 shows that a 7B student trained with APO attains a mean accuracy of 0.78, outperforming all teacher models—including GPT‑5—across disease categories. In highly divergent categories such as Consolidation and Edema, teacher accuracies differ by up to 70 %, yet APO’s student remains among the top two performers, demonstrating stability.
The results confirm that converting divergent teacher trajectories into dynamic constraints effectively blocks bias and erroneous knowledge, yielding robust and reliable reasoning.
Conclusion
APO advances multi‑teacher distillation from static learning to dynamic constraint satisfaction, formalizing teacher drift as negative constraints and embedding concept alignment as a constraint‑satisfaction problem. This enables robust reasoning alignment for multimodal models in high‑risk, rapidly changing domains.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
