Visual grounding is a task to localize an object described by a sentence in an image. Conventional visual grounding methods extract visual and linguistic features isolatedly and then perform cross-modal interaction in a post-fusion manner. We argue that this post-fusion mechanism does not fully utilize the information in two modalities. Instead, it is more desired to perform cross-modal interaction during the extraction process of the visual and linguistic feature. In this paper, we propose a language-customized visual feature learning mechanism where linguistic information guides the extraction of visual feature from the very beginning. We instantiate the mechanism as a one-stage framework named Progressive Language-customized Visual feature learning (PLV). Our proposed PLV consists of a Progressive Language-customized Visual Encoder (PLVE) and a grounding module. We customize the visual feature with linguistic guidance at each stage of the PLVE by Channel-wise Language-guided Interaction Modules (CLIM). Our proposed PLV outperforms conventional state-of-the-art methods with large margins across five visual grounding datasets without pre-training on object detection datasets, while achieving real-time speed. The source code is available in the supplementary material.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TIP.2022.3181516DOI Listing

Publication Analysis

Top Keywords

visual feature
20
language-customized visual
16
visual grounding
16
progressive language-customized
12
visual
12
feature learning
12
visual linguistic
8
perform cross-modal
8
cross-modal interaction
8
proposed plv
8

Similar Publications

Objective: Craniopharyngiomas are rare, benign brain tumors that are primarily treated with surgery. Although the extended endoscopic endonasal approach (EEEA) has evolved as a more reliable surgical alternative and yields better visual outcomes than traditional craniotomy, postoperative visual deterioration remains one of the most common complications, and relevant risk factors are still poorly defined. Hence, identifying risk factors and developing a predictive model for postoperative visual deterioration is indeed necessary.

View Article and Find Full Text PDF

mettannotator: a comprehensive and scalable Nextflow annotation pipeline for prokaryotic assemblies.

Bioinformatics

January 2025

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom.

Summary: In recent years there has been a surge in prokaryotic genome assemblies, coming from both isolated organisms and environmental samples. These assemblies often include novel species that are poorly represented in reference databases creating a need for a tool that can annotate both well-described and novel taxa, and can run at scale. Here, we present mettannotator-a comprehensive, scalable Nextflow pipeline for prokaryotic genome annotation that identifies coding and non-coding regions, predicts protein functions, including antimicrobial resistance, and delineates gene clusters.

View Article and Find Full Text PDF

This article explores the existing research evidence on the potential effectiveness of lipreading as a communication strategy to enhance speech recognition in individuals with hearing impairment. A scoping review was conducted, involving a search of six electronic databases (MEDLINE, Embase, Web of Science, Engineering Village, CINAHL, and PsycINFO) for research papers published between January 2013 and June 2023. This study included original research papers with full texts available in English, covering all study designs: qualitative, quantitative, and mixed methods.

View Article and Find Full Text PDF

Purpose: The integration of artificial intelligence (AI), particularly deep learning (DL), with optical coherence tomography (OCT) offers significant opportunities in the diagnosis and management of glaucoma. This article explores the application of various DL models in enhancing OCT capabilities and addresses the challenges associated with their clinical implementation.

Methods: A review of articles utilizing DL models was conducted, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), generative adversarial networks (GANs), autoencoders, and large language models (LLMs).

View Article and Find Full Text PDF

Ultrastructure expansion microscopy: Enlarging our perspective on apicomplexan cell division.

J Microsc

January 2025

Laboratory of Apicomplexan Biology, Institut Pasteur Montevideo, Montevideo, Uruguay.

Apicomplexans, a large phylum of protozoan intracellular parasites, well known for their ability to invade and proliferate within host cells, cause diseases with major health and economic impacts worldwide. These parasites are responsible for conditions such as malaria, cryptosporidiosis, and toxoplasmosis, which affect humans and other animals. Apicomplexans exhibit complex life cycles, marked by diverse modes of cell division, which are closely associated with their pathogenesis.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!