Cross-modal metric learning (CML) deals with learning distance functions for cross-modal data matching. The existing methods mostly focus on minimizing a loss defined on sample pairs. However, the numbers of intraclass and interclass sample pairs can be highly imbalanced in many applications, and this can lead to deteriorating or unsatisfactory performances. The area under the receiver operating characteristic curve (AUC) is a more meaningful performance measure for the imbalanced distribution problem. To tackle the problem as well as to make samples from different modalities directly comparable, a CML method is presented by directly maximizing AUC. The method can be further extended to focus on optimizing partial AUC (pAUC), which is the AUC between two specific false positive rates (FPRs). This is particularly useful in certain applications where only the performances assessed within predefined false positive ranges are critical. The proposed method is formulated as a log-determinant regularized semidefinite optimization problem. For efficient optimization, a minibatch proximal point algorithm is developed. The algorithm is experimentally verified stable with the size of sampled pairs that form a minibatch at each iteration. Several data sets have been used in evaluation, including three cross-modal data sets on face recognition under various scenarios and a single modal data set, the Labeled Faces in the Wild. Results demonstrate the effectiveness of the proposed methods and marked improvements over the existing methods. Specifically, pAUC-optimized CML proves to be more competitive for performance measures such as Rank-1 and verification rate at FPR = 0.1%.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TNNLS.2017.2769128 | DOI Listing |
Physiol Meas
January 2025
Academy of Military Science of the People's Liberation Army, Beijing, 100073, CHINA.
Objective: Humanity faces many health challenges, among which respiratory diseases are one of the leading causes of human death. Existing AI-driven pre-diagnosis approaches can enhance the efficiency of diagnosis but still face challenges. For example, single-modal data suffer from information redundancy or loss, difficulty in learning relationships between features, and revealing the obscure characteristics of complex diseases.
View Article and Find Full Text PDFNeural Netw
December 2024
School of Computer and Electronic Information, Guangxi University, University Road, Nanning, 530004, Guangxi, China. Electronic address:
Vision-language navigation (VLN) is a challenging task that requires agents to capture the correlation between different modalities from redundant information according to instructions, and then make sequential decisions on visual scenes and text instructions in the action space. Recent research has focused on extracting visual features and enhancing text knowledge, ignoring the potential bias in multi-modal data and the problem of spurious correlations between vision and text. Therefore, this paper studies the relationship structure between multi-modal data from the perspective of causality and weakens the potential correlation between different modalities through cross-modal causality reasoning.
View Article and Find Full Text PDFSensors (Basel)
December 2024
Department of Computer Engineering, Gachon University, Sujeong-gu, Seongnam-si 13120, Republic of Korea.
Generating accurate and contextually rich captions for images and videos is essential for various applications, from assistive technology to content recommendation. However, challenges such as maintaining temporal coherence in videos, reducing noise in large-scale datasets, and enabling real-time captioning remain significant. We introduce MIRA-CAP (Memory-Integrated Retrieval-Augmented Captioning), a novel framework designed to address these issues through three core innovations: a cross-modal memory bank, adaptive dataset pruning, and a streaming decoder.
View Article and Find Full Text PDFMed Image Anal
December 2024
School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200030, China; Department of Cardiovascular Medicine, University of Oxford, OX39DU, UK. Electronic address:
Med Image Anal
December 2024
Chongqing Key Laboratory of Image Cognition, College of Computer Science and Technology, Chongqing University of Posts and Telecommunication, Chongqing, 400065, China.
Medical report generation, a cross-modal task of generating medical text information, aiming to provide professional descriptions of medical images in clinical language. Despite some methods have made progress, there are still some limitations, including insufficient focus on lesion areas, omission of internal edge features, and difficulty in aligning cross-modal data. To address these issues, we propose Dual-Modality Visual Feature Flow (DMVF) for medical report generation.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!