Image captioning is a technique used to generate descriptive captions for images. Typically, it involves employing a Convolutional Neural Network (CNN) as the encoder to extract visual features, and a decoder model, often based on Recurrent Neural Networks (RNNs), to generate the captions. Recently, the encoder-decoder architecture has witnessed the widespread adoption of the self-attention mechanism. However, this approach faces certain challenges that require further research. One such challenge is that the extracted visual features do not fully exploit the available image information, primarily due to the absence of semantic concepts. This limitation restricts the ability to fully comprehend the content depicted in the image. To address this issue, we present a new image-Transformer-based model boosted with image object semantic representation. Our model incorporates semantic representation in encoder attention, enhancing visual features by integrating instance-level concepts. Additionally, we employ Transformer as the decoder in the language generation module. By doing so, we achieve improved performance in generating accurate and diverse captions. We evaluated the performance of our model on the MS-COCO and novel MACE datasets. The results illustrate that our model aligns with state-of-the-art approaches in terms of caption generation.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10975165PMC
http://dx.doi.org/10.3390/s24061796DOI Listing

Publication Analysis

Top Keywords

visual features
12
image captioning
8
semantic representation
8
image
5
model
5
insights object
4
object semantics
4
semantics leveraging
4
leveraging transformer
4
transformer networks
4

Similar Publications

Osteoarthritis (OA) is heterogeneous and involves structural changes in the whole joint, such as cartilage, meniscus/labrum, ligaments, and tendons, mainly with short T2 relaxation times. Detecting OA before the onset of irreversible changes is crucial for early proactive management and limit growing disease burden. The more recent advanced quantitative imaging techniques and deep learning (DL) algorithms in musculoskeletal imaging have shown great potential for visualizing "pre-OA.

View Article and Find Full Text PDF

Introduction: Infectious keratitis is a rare but devastating complication following photorefractive keratectomy (PRK) that may lead to visual impairment. This study assessed the clinical features, treatment strategies, and outcomes of post-PRK infectious keratitis.

Methods: This retrospective study was conducted on patients with post-PRK infectious keratitis presenting to Khalili Hospital, Shiraz, Iran, from June 2011 to March 2024.

View Article and Find Full Text PDF

Purpose: To develop a deep learning (DL) model based on primary tumor tissue to predict the lymph node metastasis (LNM) status of muscle invasive bladder cancer (MIBC), while validating the prognostic value of the predicted aiN score in MIBC patients.

Methods: A total of 323 patients from The Cancer Genome Atlas (TCGA) were used as the training and internal validation set, with image features extracted using a visual encoder called UNI. We investigated the ability to predict LNM status while assessing the prognostic value of aiN score.

View Article and Find Full Text PDF

[Spontaneous craniocervical dissection].

Radiologie (Heidelb)

January 2025

Klinik für diagnostische und interventionelle Neuroradiologie, Universitätskliniken des Saarlandes, Kirrberger Str., 66421, Homburg Saar, Deutschland.

Performance: Spontaneous dissections of the cerebral arteries are among the leading causes of stroke in young adults. They result from hemorrhage into the outer layers of the arterial wall, which can lead to stenosis or even complete vessel occlusion. Clinical presentations vary, ranging from localized pain to cerebral ischemic complications.

View Article and Find Full Text PDF

Movie-watching is a central aspect of our lives and an important paradigm for understanding the brain mechanisms behind cognition as it occurs in daily life. Contemporary views of ongoing thought argue that the ability to make sense of events in the 'here and now' depend on the neural processing of incoming sensory information by auditory and visual cortex, which are kept in check by systems in association cortex. However, we currently lack an understanding of how patterns of ongoing thoughts map onto the different brain systems when we watch a film, partly because methods of sampling experience disrupt the dynamics of brain activity and the experience of movie-watching.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!