How Dual-Optimization Distillation Boosts SAM-Driven Multimodal Image Fusion
This article presents a novel dual‑optimization distillation framework that injects Segment Anything Model (SAM) semantic priors into multimodal image fusion, achieving unified visual quality and task accuracy while using a lightweight sub‑network to keep inference efficient.
Project Information
Paper: https://arxiv.org/abs/2503.01210
Code: https://github.com/RollingPlain/SAGE_IVIF
Key Highlights
Unified visual quality and task precision : Traditional fusion methods focus on visual appearance and ignore downstream task requirements, while early deep‑learning approaches suffer from inconsistent optimization objectives. The proposed dual‑level distillation framework injects SAM’s semantic priors into the fusion network, achieving high visual fidelity and strong task performance.
Lightweight practical design : Knowledge is distilled into a compact sub‑network that retains high‑quality fusion results and supports downstream segmentation without the heavy computational burden of the full SAM model.
Problems Addressed
Limitations of traditional methods : Information‑theoretic fusion struggles with redundancy and specific scenes; early deep‑learning fusion produces blurry edges and artifacts, failing strict downstream requirements.
Conflicting optimization goals : Coupling fusion with downstream tasks creates a trade‑off between visual quality and task adaptability.
Computational cost of SAM : Full SAM inference is prohibitive for resource‑constrained environments, limiting practical deployment.
Proposed Method
Incorporating SAM semantic priors : SAM’s rich semantic knowledge is injected into the multimodal fusion pipeline, enhancing scene understanding and improving both visual and task performance.
SPA feature preservation and integration : The SPA module uses a persistent storage mechanism to retain key source features and merges them with SAM‑derived semantics via cross‑attention, enabling deep multimodal fusion.
Dual‑level optimization‑driven distillation : A triple‑loss distillation transfers SAM‑enhanced representations from a heavyweight main network to a lightweight sub‑network, allowing independent inference without SAM while preserving performance.
Design Motivation & Overall Architecture
Core challenge : Leverage SAM’s semantic priors at inference time without incurring its heavy computation.
Innovative framework : A dual‑level optimization architecture comprising a SAM‑augmented main network and a lightweight sub‑network. The optimization objective explicitly models cooperative learning between the two networks, reducing inference cost while maintaining fusion quality.
Technical highlights : A class‑DARTS‑style training strategy alternates optimization of the two networks, and a composite loss (feature alignment, contextual consistency, and semantic contrast) ensures the sub‑network captures essential knowledge.
Experimental Setup
Five representative datasets were used: TNO, RoadScene, MFNet, FMB, and M3FD. Evaluation metrics include EN, SD, SCD, MS‑SSIM for image quality, BRISQUE, NIQE, MUSIQ, PaQ‑2‑PiQ for no‑reference quality, and IoU/mIoU for segmentation performance.
Results
Qualitative
Information retention : Preserves visible‑light texture and infrared thermal cues, achieving a “best‑of‑both‑worlds” fusion.
Robustness to interference : Accurately reconstructs lane markings and distant structures under night fog, demonstrating strong environmental adaptability.
Quantitative
On the FMB segmentation benchmark, the method reaches 61.2% mIoU with SegFormer‑B3 (3.0% improvement over the runner‑up) and 51.1% mIoU on the open‑vocabulary X‑Decoder without retraining.
Conclusion
A SAM‑guided dual‑distillation approach unifies visual quality and downstream task accuracy for infrared‑visible image fusion while dramatically reducing inference cost, offering a new direction for the field.
Latest Research
Survey paper “Infrared and Visible Image Fusion: From Data Compatibility to Task Adaptation” accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Paper: https://ieeexplore.ieee.org/abstract/document/10812907
GitHub resources: https://github.com/RollingPlain/IVIF_ZOO
Code example
收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
