How GIFNet’s Low‑Level Interaction Breakthrough Enables Universal Multimodal Fusion Across Tasks

The paper introduces GIFNet, a three‑branch network that leverages low‑level visual tasks and a cross‑fusion gating mechanism to achieve a single, task‑agnostic image‑fusion model with dramatically reduced computation, strong generalization to unseen modalities, and even single‑modal enhancement capabilities.

AIWalker
AIWalker
AIWalker
How GIFNet’s Low‑Level Interaction Breakthrough Enables Universal Multimodal Fusion Across Tasks

Overview

The authors identify three major obstacles in current image‑fusion research: (1) a semantic gap between high‑level visual tasks (e.g., detection, segmentation) and pixel‑level fusion, (2) poor generalization of task‑specific models to new devices, and (3) high computational cost of existing pipelines. To close this gap they propose a low‑level task‑driven fusion framework called GIFNet , which unifies multimodal fusion (MM) and digital‑photography (DP) tasks under a single architecture.

Proposed Solution

Low‑Level Visual Task Supervision : Use tasks such as multi‑focus (MFIF) and multi‑exposure (MEIF) image fusion to provide pixel‑level supervision, avoiding interference from high‑level semantics.

GIFNet Architecture : A three‑branch network consisting of a main‑task branch (MM), an auxiliary‑task branch (DP), and a harmonization branch (REC). The main and auxiliary branches process multimodal and digital‑photography features respectively, while the reconstruction branch learns shared representations via a shared encoder (S‑Enc).

Cross‑Fusion Gating Mechanism (CFGM) : Built on Swin‑Transformer blocks, CFGM iteratively optimizes task‑specific features and adaptively routes mixed features to a global decoder (G‑Dec). A learnable parameter α controls the influence of the auxiliary task.

RGB Joint Dataset : Data‑augmentation creates a unified RGB‑centric dataset that aligns RGB, infrared, near‑focus, and far‑focus images, reducing domain gaps between multimodal and DP tasks.

Training and Inference

Two loss terms are employed:

Common loss (L<sub>common</sub>) combines structural similarity (SSIM) and mean‑squared error (MSE) between the reconstruction branch output and the shared RGB modality.

Private loss (L<sub>priv</sub>) is defined per task. For the multimodal branch, an information‑weighted loss guides the fused image to retain source content; for the DP branch, supervised loss leverages the MFIF ground‑truth.

During training, only the IVIF dataset (LLVIP) and the augmented DP data are used. At inference time, a single pair of images is fed into the network; the CFGM fuses the two task‑specific representations and the global decoder reconstructs the final fused image.

Experimental Setup

Evaluation spans a wide range of tasks and datasets, including:

IVIF: LLVIP, TNO

MFIF: Lytro, MFI‑WHU

Medical fusion: Harvard

NIR‑VIS: VIS‑NIR Scene

MEF: SCIE

Remote‑sensing: QuickBird

Classification: CIFAR‑100 (to test single‑modal enhancement)

Metrics include VIF, SCD, EI, AG for fusion quality and top‑1/top‑5 accuracy for classification.

Ablation Studies

Component analysis (Table 1) shows that combining multi‑task learning (MTL) with the reconstruction branch (REC) yields strong performance, while removing either component causes divergence. Adding the CFGM or REC alone improves results, and their combination achieves the best scores.

Task‑combination experiments demonstrate that the DP supervision consistently boosts fusion quality; MFIF provides more compatible pixel‑level cues than MEIF, leading to higher VIF and AG scores.

The adaptive CFGM (learnable α) outperforms traditional fixed fusion operators, producing more robust fused images across quantitative and qualitative evaluations.

Feature Visualization

Visualizations of shared encoder (S‑Enc), MM, and DP branches reveal that S‑Enc captures basic structures (edges, contours), MM preserves salient modality‑specific cues (e.g., thermal targets), and DP enhances fine details and textures. These patterns hold across both seen and unseen tasks.

Results on Seen and Unseen Tasks

Seen multimodal tasks (MFIF, IVIF) : GIFNet achieves the highest VIF (+25% over the next best) and superior edge intensity and gradient metrics, outperforming specialized methods such as Text‑IF, CDDFuse, DDFM, LRRNet, ZMFF, and UNIFusion.

Unseen tasks (MEIF, NIR‑VIS, remote‑sensing, medical) : GIFNet consistently delivers clearer textures, better exposure balance, and higher VIF/AG scores (e.g., +46.7% VIF on MEIF) compared to state‑of‑the‑art algorithms like MEF‑GAN, SPD‑MEF, IID‑MEF, MURF, P2Sharpen, ZeroSharpen, and IFCNN.

Single‑modal classification : Using GIFNet‑enhanced images to train ResNet‑56 on CIFAR‑100 yields higher top‑1 accuracy than all other fusion‑based augmentation methods, demonstrating that the task‑agnostic representation also benefits downstream vision tasks.

Model Size and Efficiency

GIFNet reduces computational cost by over 96% relative to high‑level fusion baselines, making it suitable for deployment on resource‑constrained devices.

Conclusion

The study presents a novel low‑level task interaction paradigm that bridges the long‑overlooked gap in generalized image fusion. By integrating a shared reconstruction task and an RGB‑joint dataset, GIFNet achieves strong cross‑task generalization, low computation, and the first single‑model capability for both multimodal fusion and single‑modal enhancement.

References

[1] One Model for ALL: Low‑Level Task Interaction Is a Key to Task‑Agnostic Image Fusion

Reference figure
Reference figure
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Computer VisionmultimodalCVPR2025Image Fusioncross-task learningGIFNetlow-level interaction
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.