Cross-modal retrieval has recently attracted much interest along with the rapid development of multimodal data, and effectively utilizing the complementary relationship of different modal data and eliminating the heterogeneous gap as much as possible are the two key challenges. In this paper, we present a novel network model termed cross-modal Dual Subspace learning with Adversarial Network (DSAN). The main contributions are as follows: (1) Dual subspaces (visual subspace and textual subspace) are proposed, which can better mine the underlying structure information of different modalities as well as modality-specific information. (2) An improved quadruplet loss is proposed, which takes into account the relative distance and absolute distance between positive and negative samples, together with the introduction of the idea of hard sample mining. (3) Intra-modal constrained loss is proposed to maximize the distance of the most similar cross-modal negative samples and their corresponding cross-modal positive samples. In particular, feature preserving and modality classification act as two antagonists. DSAN tries to narrow the heterogeneous gap between different modalities, and distinguish the original modality of random samples in dual subspaces. Comprehensive experimental results demonstrate that, DSAN significantly outperforms 9 state-of-the-art methods on four cross-modal datasets.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1016/j.neunet.2020.03.015 | DOI Listing |
Med Image Anal
January 2025
Sixth Medical Center of Chinese PLA General Hospital, Beijing, 100037, China. Electronic address:
Medical Visual Question Answering aims to assist doctors in decision-making when answering clinical questions regarding radiology images. Nevertheless, current models learn cross-modal representations through residing vision and text encoders in dual separate spaces, which inevitably leads to indirect semantic alignment. In this paper, we propose UnICLAM, a Unified and Interpretable Medical-VQA model through Contrastive Representation Learning with Adversarial Masking.
View Article and Find Full Text PDFSoft Robot
January 2025
Department of Mechanical Engineering, Korea Advanced Institute of Science and Technology, Republic of Korea.
Data-driven calibration methods have shown promising results for accurate proprioception in soft robotics. This process can be greatly benefited by adopting numerical simulation for computational efficiency. However, the gap between the simulated and real domains limits the accurate, generalized application of the approach.
View Article and Find Full Text PDFSci Rep
December 2024
School of Computer Science and Engineering, North Minzu University, Yinchuan, 750021, China.
Multi-modal medical images are important in tumor lesion detection. However, the existing detection models only use single-modal to detect lesions, a multi-modal semantic correlation is not enough to consider and lacks ability to express the shape, size, and contrast degree features of lesions. A Cross Modal YOLOv5 model (CMYOLOv5) is proposed.
View Article and Find Full Text PDFNeural Netw
December 2024
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China; Beijing Key Laboratory of Network System and Network Culture, Beijing, China.
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. A primary challenge in this task is bridging the substantial representational gap between visual and textual modalities. The prevailing methods map texts and images into unified embedding space for matching, while the intricate semantic correspondences between texts and images are still not effectively constructed.
View Article and Find Full Text PDFMed Image Anal
December 2024
Chongqing Key Laboratory of Image Cognition, College of Computer Science and Technology, Chongqing University of Posts and Telecommunication, Chongqing, 400065, China.
Medical report generation, a cross-modal task of generating medical text information, aiming to provide professional descriptions of medical images in clinical language. Despite some methods have made progress, there are still some limitations, including insufficient focus on lesion areas, omission of internal edge features, and difficulty in aligning cross-modal data. To address these issues, we propose Dual-Modality Visual Feature Flow (DMVF) for medical report generation.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!