On the Limitations of Visual-Semantic Embedding Networks for Image-to-Text Information Retrieval.

J Imaging

Department of Computer Science, School of Science, Loughborough University, Loughborough LE11 3TT, UK.

Published: July 2021

Visual-semantic embedding (VSE) networks create joint image-text representations to map images and texts in a shared embedding space to enable various information retrieval-related tasks, such as image-text retrieval, image captioning, and visual question answering. The most recent state-of-the-art VSE-based networks are: VSE++, SCAN, VSRN, and UNITER. This study evaluates the performance of those VSE networks for the task of image-to-text retrieval and identifies and analyses their strengths and limitations to guide future research on the topic. The experimental results on Flickr30K revealed that the pre-trained network, UNITER, achieved 61.5% on average Recall@5 for the task of retrieving all relevant descriptions. The traditional networks, VSRN, SCAN, and VSE++, achieved 50.3%, 47.1%, and 29.4% on average Recall@5, respectively, for the same task. An additional analysis was performed on image-text pairs from the top 25 worst-performing classes using a subset of the Flickr30K-based dataset to identify the limitations of the performance of the best-performing models, VSRN and UNITER. These limitations are discussed from the perspective of image scenes, image objects, image semantics, and basic functions of neural networks. This paper discusses the strengths and limitations of VSE networks to guide further research into the topic of using VSE networks for cross-modal information retrieval tasks.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8404943PMC
http://dx.doi.org/10.3390/jimaging7080125DOI Listing

Publication Analysis

Top Keywords

vse networks
16
visual-semantic embedding
8
networks
8
image-to-text retrieval
8
vsrn uniter
8
strengths limitations
8
average recall@5
8
recall@5 task
8
limitations
5
limitations visual-semantic
4

Similar Publications

Cognitive control has been investigated in attentional conflict tasks for a long time. One representative phenomenon of adaptive cognitive control in these tasks is the congruency sequence effect (CSE), which means that a previous conflict will lead to reduced congruency effects at the current moment, reflecting increased control of attention toward the task at hand. One debating question is whether CSE can generalize between different conditions.

View Article and Find Full Text PDF

Peeking into the Stingers: A Comprehensive SWATH-MS Study of the European Hornet (Linnaeus, 1758) (Hymenoptera: Vespidae) Venom Sac Extracts.

Int J Mol Sci

March 2024

Fundación Instituto de Investigación Sanitaria de Santiago de Compostela (FIDIS), Hospital Clínico, 15706 Santiago de Compostela, Spain.

This study aimed to investigate the venom sac extracts (VSEs) of the European hornet (EH) (Linnaeus, 1758) (Hymenoptera: Vespidae), focusing on the differences between stinging females, gynes (G), and workers (W), at the protein level. Using a quantitative "Sequential Window Acquisition of all Theoretical Fragment Ion Mass Spectra" (SWATH-MS) analysis, we identified and quantified a total of 240 proteins. Notably, within the group, 45.

View Article and Find Full Text PDF

This paper proposes a novel vehicle state estimation (VSE) method that combines a physics-informed neural network (PINN) and an unscented Kalman filter on manifolds (UKF-M). This VSE aimed to achieve inertial measurement unit (IMU) calibration and provide comprehensive information on the vehicle's dynamic state. The proposed method leverages a PINN to eliminate IMU drift by constraining the loss function with ordinary differential equations (ODEs).

View Article and Find Full Text PDF

Defect engineering of transition metal dichalcogenides (TMDCs) is important for improving electrocatalytic hydrogen evolution reaction (HER) performance. Herein, a facile and scalable atomic-level di-defect strategy over thermodynamically stable VSe nanoflakes, yielding attractive improvements in the electrocatalytic HER performance over a wide electrolyte pH range is reported. The di-defect configuration with controllable spatial relation between single-atom (SA) V defects and single Se vacancy defects effectively triggers the electrocatalytic HER activity of the inert VSe basal plane.

View Article and Find Full Text PDF

Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and available metadata. As the input collection, we use the CORD-19 corpus containing research articles-mostly from biology and medicine-applicable to the COVID-19 crisis.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!