Current one-stage methods for visual grounding encode the language query as one holistic sentence embedding before fusion with visual features for target localization. Such a formulation provides insufficient ability to model query at the word level, and therefore is prone to neglect words that may not be the most important ones for a sentence but are critical for the referred object. In this article, we propose Word2Pix: a one-stage visual grounding network based on the encoder-decoder transformer architecture that enables learning for textual to visual feature correspondence via word to pixel attention. Each word from the query sentence is given an equal opportunity when attending to visual pixels through multiple stacks of transformer decoder layers. In this way, the decoder can learn to model the language query and fuse language with the visual features for target prediction simultaneously. We conduct the experiments on RefCOCO, RefCOCO+, and RefCOCOg datasets, and the proposed Word2Pix outperforms the existing one-stage methods by a notable margin. The results obtained also show that Word2Pix surpasses the two-stage visual grounding models, while at the same time keeping the merits of the one-stage paradigm, namely, end-to-end training and fast inference speed. Code is available at https://github.com/azurerain7/Word2Pix.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TNNLS.2022.3183827DOI Listing

Publication Analysis

Top Keywords

visual grounding
16
word pixel
8
visual
8
one-stage methods
8
language query
8
visual features
8
features target
8
word2pix
4
word2pix word
4
pixel cross-attention
4

Similar Publications

Background: Ultrasound lung surface motion measurement is valuable for the evaluation of a variety of diseases. Speckle tracking or Doppler-based techniques are limited by the loss of visualization as a tracked point moves under ribs or is dependent.

Methods: We developed a synthetic lateral phase-based algorithm for tracking lung motion to overcome these limitations.

View Article and Find Full Text PDF

Immersive exposure to simulated visual hallucinations modulates high-level human cognition.

Conscious Cogn

January 2025

Humane Technology Lab, Catholic University of Sacred Heart, Milan, Italy; Applied Technology for Neuro-Psychology Lab., Istituto Auxologico Italiano IRCCS, Milan, Italy. Electronic address:

Psychedelic drugs offer valuable insights into consciousness, but disentangling their causal effects on perceptual and high-level cognition is nontrivial. Technological advances in virtual reality (VR) and machine learning have enabled the immersive simulation of visual hallucinations. However, comprehensive experimental data on how these simulated hallucinations affects high-level human cognition is lacking.

View Article and Find Full Text PDF

Ground-based LiDAR technology has been widely applied in various fields for acquiring 3D point cloud data, including spatial coordinates, digital color information, and laser reflectance intensities (I-values). These datasets preserve the digital information of scanned objects, supporting value-added applications. However, raw point cloud data visually represent spatial features but lack attribute information, posing challenges for automated object classification and effective management.

View Article and Find Full Text PDF

The strong motion records collected in full-scale structures provide the ultimate evidence of how real structures, in situ, respond to earthquakes. This paper presents a novel method for visualization, in three dimensions (3D), of the collective motion by a dense array of sensors in a building. The method is based on one- and two-dimensional biharmonic spline interpolation of the motion recorded by multiple sensors on the same or multiple floors.

View Article and Find Full Text PDF

Background: The aim of this study was to evaluate visual outcomes and patient satisfaction after bilateral implantation of a new hydrophobic acrylic intraocular lens called Clareon (Alcon) using the mini-monovision technique.

Methods: A single-center, prospective, nonrandomized study was conducted in Tandil (Buenos Aires, Argentina), including patients scheduled for cataract surgery. To achieve mini-monovision, the spherical equivalent was calculated between -0.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!