Current one-stage methods for visual grounding encode the language query as one holistic sentence embedding before fusion with visual features for target localization. Such a formulation provides insufficient ability to model query at the word level, and therefore is prone to neglect words that may not be the most important ones for a sentence but are critical for the referred object. In this article, we propose Word2Pix: a one-stage visual grounding network based on the encoder-decoder transformer architecture that enables learning for textual to visual feature correspondence via word to pixel attention. Each word from the query sentence is given an equal opportunity when attending to visual pixels through multiple stacks of transformer decoder layers. In this way, the decoder can learn to model the language query and fuse language with the visual features for target prediction simultaneously. We conduct the experiments on RefCOCO, RefCOCO+, and RefCOCOg datasets, and the proposed Word2Pix outperforms the existing one-stage methods by a notable margin. The results obtained also show that Word2Pix surpasses the two-stage visual grounding models, while at the same time keeping the merits of the one-stage paradigm, namely, end-to-end training and fast inference speed. Code is available at https://github.com/azurerain7/Word2Pix.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TNNLS.2022.3183827 | DOI Listing |
J Clin Ultrasound
January 2025
JD Hamilton Consulting, Brighton, Michigan, USA.
Background: Ultrasound lung surface motion measurement is valuable for the evaluation of a variety of diseases. Speckle tracking or Doppler-based techniques are limited by the loss of visualization as a tracked point moves under ribs or is dependent.
Methods: We developed a synthetic lateral phase-based algorithm for tracking lung motion to overcome these limitations.
Conscious Cogn
January 2025
Humane Technology Lab, Catholic University of Sacred Heart, Milan, Italy; Applied Technology for Neuro-Psychology Lab., Istituto Auxologico Italiano IRCCS, Milan, Italy. Electronic address:
Psychedelic drugs offer valuable insights into consciousness, but disentangling their causal effects on perceptual and high-level cognition is nontrivial. Technological advances in virtual reality (VR) and machine learning have enabled the immersive simulation of visual hallucinations. However, comprehensive experimental data on how these simulated hallucinations affects high-level human cognition is lacking.
View Article and Find Full Text PDFSensors (Basel)
January 2025
Department of Civil Engineering and Engineering Management, National Quemoy University, Kinmen 89250, Taiwan.
Ground-based LiDAR technology has been widely applied in various fields for acquiring 3D point cloud data, including spatial coordinates, digital color information, and laser reflectance intensities (I-values). These datasets preserve the digital information of scanned objects, supporting value-added applications. However, raw point cloud data visually represent spatial features but lack attribute information, posing challenges for automated object classification and effective management.
View Article and Find Full Text PDFSensors (Basel)
January 2025
Yunnan Earthquake Agency, Kunming 650224, China.
The strong motion records collected in full-scale structures provide the ultimate evidence of how real structures, in situ, respond to earthquakes. This paper presents a novel method for visualization, in three dimensions (3D), of the collective motion by a dense array of sensors in a building. The method is based on one- and two-dimensional biharmonic spline interpolation of the motion recorded by multiple sensors on the same or multiple floors.
View Article and Find Full Text PDFLife (Basel)
January 2025
Centro Oftalmológico Charles, Buenos Aiers C1116, Argentina.
Background: The aim of this study was to evaluate visual outcomes and patient satisfaction after bilateral implantation of a new hydrophobic acrylic intraocular lens called Clareon (Alcon) using the mini-monovision technique.
Methods: A single-center, prospective, nonrandomized study was conducted in Tandil (Buenos Aires, Argentina), including patients scheduled for cataract surgery. To achieve mini-monovision, the spherical equivalent was calculated between -0.
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!