ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition.

J Imaging

Department of Core Informatics, Graduate School of Informatics, Osaka Metropolitan University, Osaka 599-8531, Japan.

Published: December 2023

Attention-based encoder-decoder scene text recognition (STR) architectures have been proven effective in recognizing text in the real world, thanks to their ability to learn an internal language model. Nevertheless, the cross-attention operation that is used to align visual and linguistic features during decoding is computationally expensive, especially in low-resource environments. To address this bottleneck, we propose a cross-attention-free STR framework that still learns a language model. The framework we propose is ViTSTR-Transducer, which draws inspiration from ViTSTR, a vision transformer (ViT)-based method designed for STR and the recurrent neural network transducer (RNN-T) initially introduced for speech recognition. The experimental results show that our ViTSTR-Transducer models outperform the baseline attention-based models in terms of the required decoding floating point operations (FLOPs) and latency while achieving a comparable level of recognition accuracy. Compared with the baseline context-free ViTSTR models, our proposed models achieve superior recognition accuracy. Furthermore, compared with the recent state-of-the-art (SOTA) methods, our proposed models deliver competitive results.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10744502PMC
http://dx.doi.org/10.3390/jimaging9120276DOI Listing

Publication Analysis

Top Keywords

vision transformer
8
scene text
8
text recognition
8
language model
8
recognition accuracy
8
accuracy compared
8
proposed models
8
recognition
5
models
5
vitstr-transducer cross-attention-free
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!