In this study, we propose Multimodal Fusion-supervised Cross-modality Alignment Perception (MulFS-CAP), a novel framework for single-stage fusion of unregistered infrared-visible images. Traditional two-stage methods depend on explicit registration algorithms to align source images spatially, often adding complexity. In contrast, MulFS-CAP seamlessly blends implicit registration with fusion, simplifying the process and enhancing suitability for practical applications. MulFS-CAP utilizes a shared shallow feature encoder to merge unregistered infrared-visible images in a single stage. To address the specific requirements of feature-level alignment and fusion, we develop a consistent feature learning approach via a learnable modality dictionary. This dictionary provides complementary information for unimodal features, thereby maintaining consistency between individual and fused multimodal features. As a result, MulFS-CAP effectively reduces the impact of modality variance on cross-modality feature alignment, allowing for simultaneous registration and fusion. Additionally, in MulFS-CAP, we advance a novel cross-modality alignment approach, creating a correlation matrix to detail pixel relationships between source images. This matrix aids in aligning features across infrared and visible images, further refining the fusion process. The above designs make MulFS-CAP more lightweight, effective and explicit registration-free. Experimental results from different datasets demonstrate the effectiveness of our proposed method and its superiority over the state-of-the-art two-stage methods. The source code of our method is available at https://github.com/YR0211/MulFS-CAP.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TPAMI.2025.3535617DOI Listing

Publication Analysis

Top Keywords

cross-modality alignment
12
unregistered infrared-visible
12
multimodal fusion-supervised
8
fusion-supervised cross-modality
8
alignment perception
8
infrared-visible images
8
two-stage methods
8
source images
8
registration fusion
8
mulfs-cap
7

Similar Publications

Recent advancements in Multimodal Large Language Models (MLLMs) underscore the significance of scalable models and data to boost performance, yet this often incurs substantial computational costs. Although the Mixture of Experts (MoE) architecture has been employed to scale large language or visual-language models efficiently, these efforts typically involve fewer experts and limited modalities. To address this, our work presents the pioneering attempt to develop a unified MLLM with the MoE architecture, named Uni-MoE that can handle a wide array of modalities.

View Article and Find Full Text PDF

In this study, we propose Multimodal Fusion-supervised Cross-modality Alignment Perception (MulFS-CAP), a novel framework for single-stage fusion of unregistered infrared-visible images. Traditional two-stage methods depend on explicit registration algorithms to align source images spatially, often adding complexity. In contrast, MulFS-CAP seamlessly blends implicit registration with fusion, simplifying the process and enhancing suitability for practical applications.

View Article and Find Full Text PDF

Existing studies of multi-modality medical image segmentation tend to aggregate all modalities without discrimination and employ multiple symmetric encoders or decoders for feature extraction and fusion. They often overlook the different contributions to visual representation and intelligent decisions among multi-modality images. Motivated by this discovery, this paper proposes an asymmetric adaptive heterogeneous network for multi-modality image feature extraction with modality discrimination and adaptive fusion.

View Article and Find Full Text PDF

RGB-Thermal Salient Object Detection (RGB-T SOD) aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images. A key challenge lies in bridging the inherent disparities between RGB and Thermal modalities for effective saliency map prediction. Traditional encoder-decoder architectures, while designed for cross-modality feature interactions, may not have adequately considered the robustness against noise originating from defective modalities, thereby leading to suboptimal performance in complex scenarios.

View Article and Find Full Text PDF

Cerebrovascular segmentation from time-of-flight magnetic resonance angiography (TOF-MRA) and computed tomography angiography (CTA) is essential in providing supportive information for diagnosing and treatment planning of multiple intracranial vascular diseases. Different imaging modalities utilize distinct principles to visualize the cerebral vasculature, which leads to the limitations of expensive annotations and performance degradation while training and deploying deep learning models. In this paper, we propose an unsupervised domain adaptation framework CereTS to perform translation and segmentation of cross-modality unpaired cerebral angiography.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!