When humans hear the sound of an object, they recall associated visual information and integrate the sound with recalled visual modality to detect the object. In this article, we present a novel sound-based object detector that mimics this process. We design a visual modality recalling (VMR) memory to recall information of a visual modality based on an audio modal input (i.e., sound). To achieve this goal, we propose a VMR loss and an audio-visual association loss to guide the VMR memory to memorize visual modal information by establishing associations between audio and visual modalities. With the visual modal information recalled through the VMR memory along with the original audio input, we perform audio-visual integration. In this step, we introduce an integrated feature contrastive loss that allows the integrated feature to be embedded as if it were encoded using both audio and visual modal inputs. This guidance enables our sound-based object detector to effectively perform visual object detection even when only sound is provided. We believe that our work is a cornerstone study that offers a new perspective to conventional object detection studies that solely rely on the visual modality. Comprehensive experimental results demonstrate the effectiveness of the proposed method with the VMR memory.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TNNLS.2023.3323560DOI Listing

Publication Analysis

Top Keywords

visual modality
20
vmr memory
16
object detection
12
visual modal
12
visual
11
object
8
visual object
8
modality recalling
8
sound-based object
8
object detector
8

Similar Publications

The preferred period hypothesis posits a slowing down of motor and perceptual rhythmic preferences with age, both reflecting an increase in the common internal oscillation period. This study further investigates the preferred period hypothesis by improving the measurement of perceptual rhythmic preferences through two tasks, tempo adjustment and tempo judgment, conducted in auditory and visual modalities. The study was conducted with three groups of children (5-6, 8-9, and 11-12 years old), and a group of young adults (21 to 30 years old) during the same time of the day.

View Article and Find Full Text PDF

Introduction: Craniopharyngiomas are challenging benign tumors arising from Rathke's pouch remnants, often requiring multidisciplinary management due to their proximity to critical neurovascular structures. This meta-analysis systematically compares conventional radiation therapy (RT) and stereotactic radiosurgery (RS) in treating residual or recurrent craniopharyngiomas.

Method: A comprehensive literature search identified 44 studies, including 46 reports, meeting inclusion criteria such as progression-free survival (PFS) and post-radiotherapy complications.

View Article and Find Full Text PDF

Entrapment neuropathies of the lower extremity are often underdiagnosed due to limitations in clinical examination and electrophysiological testing. Advanced imaging techniques, particularly MR neurography and high-resolution ultrasonography (US), have significantly improved the evaluation and diagnosis of these conditions by enabling precise visualization of nerves and their surrounding anatomical structures. This review focuses on the imaging features of compressive neuropathies affecting the lumbosacral plexus and its branches, including the femoral, obturator, sciatic, common peroneal, and tibial nerves.

View Article and Find Full Text PDF

Previous research has shown that, when multiple similar items are maintained in working memory, recall precision declines. Less is known about how heterogeneous sets of items across different features within and between modalities impact recall precision. In two experiments, we investigated modality (Experiment 1, n = 79) and feature-specific (Experiment 2, n = 154) load effects on working memory performance.

View Article and Find Full Text PDF

Smart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, self-supervised methods such as contrastive learning are used to learn audio-visual representations for downstream tasks. Focusing on surveillance data, we investigate two common limitations of audio-visual contrastive learning: false negatives and the minimal sufficient information bottleneck.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!