Rethinking symbolic and visual context in Referring Expression Generation.

Front Artif Intell

Faculty of Linguistics and Literary Studies, Bielefeld University, Bielefeld, Germany.

Published: March 2023

Situational context is crucial for linguistic reference to visible objects, since the same description can refer unambiguously to an object in one context but be ambiguous or misleading in others. This also applies to Referring Expression Generation (), where the production of identifying descriptions is always dependent on a given context. Research in REG has long represented visual domains through information about objects and their properties, to determine identifying sets of target features during content determination. In recent years, research in has turned to neural modeling and recasted the REG task as an inherently multimodal problem, looking at more natural settings such as generating descriptions for objects in photographs. Characterizing the precise ways in which context influences generation is challenging in both paradigms, as context is notoriously lacking precise definitions and categorization. In multimodal settings, however, these problems are further exacerbated by the increased complexity and low-level representation of perceptual inputs. The main goal of this article is to provide a systematic review of the types and functions of visual context across various approaches to REG so far and to argue for integrating and extending different perspectives on visual context that currently co-exist in research on REG. By analyzing the ways in which symbolic REG integrates context in rule-based approaches, we derive a set of categories of contextual integration, including the distinction between and exerted by context during reference generation. Using this as a framework, we show that so far existing work in visual REG has considered only some of the ways in which visual context can facilitate end-to-end reference generation. Connecting with preceding research in related areas, as possible directions for future research, we highlight some additional ways in which contextual integration can be incorporated into REG and other multimodal generation tasks.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10072327PMC
http://dx.doi.org/10.3389/frai.2023.1067125DOI Listing

Publication Analysis

Top Keywords

visual context
16
context
11
referring expression
8
expression generation
8
contextual integration
8
reference generation
8
reg
7
visual
6
generation
6
rethinking symbolic
4

Similar Publications

Aiming at the severe occlusion problem and the tiny-scale object problem in the multi-fitting detection task, the Scene Knowledge Integrating Network (SKIN), including the scene filter module (SFM) and scene structure information module (SSIM) is proposed. Firstly, the particularity of the scene in the multi-fitting detection task is analyzed. Hence, the aggregation of the fittings is defined as the scene according to the professional knowledge of the power field and the habit of the operators in identifying the fittings.

View Article and Find Full Text PDF

Towards Context-Rich Automated Biodiversity Assessments: Deriving AI-Powered Insights from Camera Trap Data.

Sensors (Basel)

December 2024

School of Biological and Environmental Sciences, Liverpool John Moores University, James Parsons Building, Byrom Street, Liverpool L3 3AF, UK.

Camera traps offer enormous new opportunities in ecological studies, but current automated image analysis methods often lack the contextual richness needed to support impactful conservation outcomes. Integrating vision-language models into these workflows could address this gap by providing enhanced contextual understanding and enabling advanced queries across temporal and spatial dimensions. Here, we present an integrated approach that combines deep learning-based vision and language models to improve ecological reporting using data from camera traps.

View Article and Find Full Text PDF

Generating accurate and contextually rich captions for images and videos is essential for various applications, from assistive technology to content recommendation. However, challenges such as maintaining temporal coherence in videos, reducing noise in large-scale datasets, and enabling real-time captioning remain significant. We introduce MIRA-CAP (Memory-Integrated Retrieval-Augmented Captioning), a novel framework designed to address these issues through three core innovations: a cross-modal memory bank, adaptive dataset pruning, and a streaming decoder.

View Article and Find Full Text PDF

Restless legs syndrome (RLS) is a common sensorimotor sleep disorder that affects sleep quality of life. Much effort has been made to make progress in RLS pharmacotherapy; however, patients with RLS still report poor long-term symptom control. Comprehensive Mendelian randomization (MR) was performed to search for potential causal genes and drug targets using the cis-pQTL and RLS GWAS data.

View Article and Find Full Text PDF

In coronary artery bypass grafting (CABG) on pump, achieving optimal visualization is critical for surgical precision and safety. The use of blowers to clear the CABG anastomosis poses risks, including the formation of micro-embolic gas bubbles, which can be insidious and increase the risk of cerebral or myocardial complications. This retrospective study compares the effectiveness of the use of irrigation mist and CO versus a direct CO blower without irrigation in terms of visualization, postoperative fibrillation, and micro-embolic gas activity.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!