GSE: A global-local storage enhanced video object recognition model.

Yuhong Shi Hongguang Pan Ze Jiang Libin Zhang Rui Miao Zheng Wang Xinyu Lei

Neural Netw

National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi'an, 710054, China; Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, 710054, China.

Published: January 2025

The presence of substantial similarities and redundant information within video data limits the performance of video object recognition models. To address this issue, a Global-Local Storage Enhanced video object recognition model (GSE) is proposed in this paper. Firstly, the model incorporates a two-stage dynamic multi-frame aggregation module to aggregate shallow frame features. This module aggregates features in batches from each input video using feature extraction, dynamic multi-frame aggregation, and centralized concatenations, significantly reducing the model's computational burden while retaining key information. In addition, a Global-Local Storage (GS) module is constructed to retain and utilize the information in the frame sequence effectively. This module classifies features using a temporal difference threshold method and employs a processing approach of inheritance, storage, and output to filter and retain features. By integrating global, local and key features, the model can accurately capture important temporal features when facing complex video scenes. Subsequently, a Cascaded Multi-head Attention (CMA) mechanism is designed. The multi-head cascade structure in this mechanism progressively focuses on object features and explores the correlations between key and global, local features. The differential step attention calculation is used to ensure computational efficiency. Finally, we optimize the model structure and adjust parameters, and verify the GSE model performance through comprehensive experiments. Experimental results on the ImageNet 2015 and NPS-Drones datasets demonstrate that the GSE model achieves the highest mAP of 0.8352 and 0.8617, respectively. Compared with other models, the GSE model achieves a commendable balance across metrics such as precision, efficiency, and power consumption.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.neunet.2024.107109	DOI Listing

Publication Analysis

Top Keywords

global-local storage

video object

object recognition

gse model

storage enhanced

enhanced video

model

recognition model

dynamic multi-frame

multi-frame aggregation

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!