How Dual-Optimization Distillation Boosts SAM-Driven Multimodal Image Fusion

This article presents a novel dual‑optimization distillation framework that injects Segment Anything Model (SAM) semantic priors into multimodal image fusion, achieving unified visual quality and task accuracy while using a lightweight sub‑network to keep inference efficient.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
How Dual-Optimization Distillation Boosts SAM-Driven Multimodal Image Fusion

Project Information

Paper: https://arxiv.org/abs/2503.01210

Code: https://github.com/RollingPlain/SAGE_IVIF

Key Highlights

Unified visual quality and task precision : Traditional fusion methods focus on visual appearance and ignore downstream task requirements, while early deep‑learning approaches suffer from inconsistent optimization objectives. The proposed dual‑level distillation framework injects SAM’s semantic priors into the fusion network, achieving high visual fidelity and strong task performance.

Lightweight practical design : Knowledge is distilled into a compact sub‑network that retains high‑quality fusion results and supports downstream segmentation without the heavy computational burden of the full SAM model.

Problems Addressed

Limitations of traditional methods : Information‑theoretic fusion struggles with redundancy and specific scenes; early deep‑learning fusion produces blurry edges and artifacts, failing strict downstream requirements.

Conflicting optimization goals : Coupling fusion with downstream tasks creates a trade‑off between visual quality and task adaptability.

Computational cost of SAM : Full SAM inference is prohibitive for resource‑constrained environments, limiting practical deployment.

Proposed Method

Incorporating SAM semantic priors : SAM’s rich semantic knowledge is injected into the multimodal fusion pipeline, enhancing scene understanding and improving both visual and task performance.

SPA feature preservation and integration : The SPA module uses a persistent storage mechanism to retain key source features and merges them with SAM‑derived semantics via cross‑attention, enabling deep multimodal fusion.

Dual‑level optimization‑driven distillation : A triple‑loss distillation transfers SAM‑enhanced representations from a heavyweight main network to a lightweight sub‑network, allowing independent inference without SAM while preserving performance.

Design Motivation & Overall Architecture

Core challenge : Leverage SAM’s semantic priors at inference time without incurring its heavy computation.

Innovative framework : A dual‑level optimization architecture comprising a SAM‑augmented main network and a lightweight sub‑network. The optimization objective explicitly models cooperative learning between the two networks, reducing inference cost while maintaining fusion quality.

Technical highlights : A class‑DARTS‑style training strategy alternates optimization of the two networks, and a composite loss (feature alignment, contextual consistency, and semantic contrast) ensures the sub‑network captures essential knowledge.

Experimental Setup

Five representative datasets were used: TNO, RoadScene, MFNet, FMB, and M3FD. Evaluation metrics include EN, SD, SCD, MS‑SSIM for image quality, BRISQUE, NIQE, MUSIQ, PaQ‑2‑PiQ for no‑reference quality, and IoU/mIoU for segmentation performance.

Results

Qualitative

Information retention : Preserves visible‑light texture and infrared thermal cues, achieving a “best‑of‑both‑worlds” fusion.

Robustness to interference : Accurately reconstructs lane markings and distant structures under night fog, demonstrating strong environmental adaptability.

Quantitative

On the FMB segmentation benchmark, the method reaches 61.2% mIoU with SegFormer‑B3 (3.0% improvement over the runner‑up) and 51.1% mIoU on the open‑vocabulary X‑Decoder without retraining.

Conclusion

A SAM‑guided dual‑distillation approach unifies visual quality and downstream task accuracy for infrared‑visible image fusion while dramatically reducing inference cost, offering a new direction for the field.

Latest Research

Survey paper “Infrared and Visible Image Fusion: From Data Compatibility to Task Adaptation” accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Paper: https://ieeexplore.ieee.org/abstract/document/10812907

GitHub resources: https://github.com/RollingPlain/IVIF_ZOO

Code example

收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

semantic segmentationSAMImage Fusionlightweight network
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.