Cross-modal dual subspace learning with adversarial network.

Neural Netw

School of Information Science and Engineering, Shandong Normal University, Jinan 250014, Shandong Province, China; Institute of Data Science and Technology, Shandong Normal University, Jinan 250014, Shandong Province, China. Electronic address:

Published: June 2020

Cross-modal retrieval has recently attracted much interest along with the rapid development of multimodal data, and effectively utilizing the complementary relationship of different modal data and eliminating the heterogeneous gap as much as possible are the two key challenges. In this paper, we present a novel network model termed cross-modal Dual Subspace learning with Adversarial Network (DSAN). The main contributions are as follows: (1) Dual subspaces (visual subspace and textual subspace) are proposed, which can better mine the underlying structure information of different modalities as well as modality-specific information. (2) An improved quadruplet loss is proposed, which takes into account the relative distance and absolute distance between positive and negative samples, together with the introduction of the idea of hard sample mining. (3) Intra-modal constrained loss is proposed to maximize the distance of the most similar cross-modal negative samples and their corresponding cross-modal positive samples. In particular, feature preserving and modality classification act as two antagonists. DSAN tries to narrow the heterogeneous gap between different modalities, and distinguish the original modality of random samples in dual subspaces. Comprehensive experimental results demonstrate that, DSAN significantly outperforms 9 state-of-the-art methods on four cross-modal datasets.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.neunet.2020.03.015DOI Listing

Publication Analysis

Top Keywords

cross-modal dual
8
dual subspace
8
subspace learning
8
learning adversarial
8
adversarial network
8
heterogeneous gap
8
dual subspaces
8
loss proposed
8
negative samples
8
cross-modal
6

Similar Publications

Medical Visual Question Answering aims to assist doctors in decision-making when answering clinical questions regarding radiology images. Nevertheless, current models learn cross-modal representations through residing vision and text encoders in dual separate spaces, which inevitably leads to indirect semantic alignment. In this paper, we propose UnICLAM, a Unified and Interpretable Medical-VQA model through Contrastive Representation Learning with Adversarial Masking.

View Article and Find Full Text PDF

Data-driven calibration methods have shown promising results for accurate proprioception in soft robotics. This process can be greatly benefited by adopting numerical simulation for computational efficiency. However, the gap between the simulated and real domains limits the accurate, generalized application of the approach.

View Article and Find Full Text PDF

Multi-modal medical images are important in tumor lesion detection. However, the existing detection models only use single-modal to detect lesions, a multi-modal semantic correlation is not enough to consider and lacks ability to express the shape, size, and contrast degree features of lesions. A Cross Modal YOLOv5 model (CMYOLOv5) is proposed.

View Article and Find Full Text PDF

Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval.

Neural Netw

December 2024

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China; Beijing Key Laboratory of Network System and Network Culture, Beijing, China.

The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. A primary challenge in this task is bridging the substantial representational gap between visual and textual modalities. The prevailing methods map texts and images into unified embedding space for matching, while the intricate semantic correspondences between texts and images are still not effectively constructed.

View Article and Find Full Text PDF

Dual-modality visual feature flow for medical report generation.

Med Image Anal

December 2024

Chongqing Key Laboratory of Image Cognition, College of Computer Science and Technology, Chongqing University of Posts and Telecommunication, Chongqing, 400065, China.

Medical report generation, a cross-modal task of generating medical text information, aiming to provide professional descriptions of medical images in clinical language. Despite some methods have made progress, there are still some limitations, including insufficient focus on lesion areas, omission of internal edge features, and difficulty in aligning cross-modal data. To address these issues, we propose Dual-Modality Visual Feature Flow (DMVF) for medical report generation.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!