Scene text recognition, the final step of the scene text reading system, has made impressive progress based on deep neural networks. However, existing recognition methods devote to dealing with the geometrically regular or irregular scene text. They are limited to the semantically arbitrary-orientation scene text. Meanwhile, previous scene text recognizers usually learn the single-scale feature representations for various-scale characters, which cannot model effective contexts for different characters. In this paper, we propose a novel scale-adaptive orientation attention network for arbitrary-orientation scene text recognition, which consists of a dynamic log-polar transformer and a sequence recognition network. Specifically, the dynamic log-polar transformer learns the log-polar origin to adaptively convert the arbitrary rotations and scales of scene texts into the shifts in the log-polar space, which is helpful to generate the rotation-aware and scale-aware visual representation. Next, the sequence recognition network is an encoder-decoder model, which incorporates a novel character-level receptive field attention module to encode more valid contexts for various-scale characters. The whole architecture can be trained in an end-to-end manner, only requiring the word image and its corresponding ground-truth text. Extensive experiments on several public datasets have demonstrated the effectiveness and superiority of our proposed method.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TIP.2020.3045602 | DOI Listing |
Neural Netw
December 2024
School of Computer and Electronic Information, Guangxi University, University Road, Nanning, 530004, Guangxi, China. Electronic address:
Vision-language navigation (VLN) is a challenging task that requires agents to capture the correlation between different modalities from redundant information according to instructions, and then make sequential decisions on visual scenes and text instructions in the action space. Recent research has focused on extracting visual features and enhancing text knowledge, ignoring the potential bias in multi-modal data and the problem of spurious correlations between vision and text. Therefore, this paper studies the relationship structure between multi-modal data from the perspective of causality and weakens the potential correlation between different modalities through cross-modal causality reasoning.
View Article and Find Full Text PDFSensors (Basel)
December 2024
School of Digital and Intelligent Industry, Inner Mongolia University of Science and Technology, Baotou 014010, China.
Text recognition is a rapidly evolving task with broad practical applications across multiple industries. However, due to the arbitrary-shape text arrangement, irregular text font, and unintended occlusion of font, this remains a challenging task. To handle images with arbitrary-shape text arrangement and irregular text font, we designed the Discriminative Standard Text Font (DSTF) and the Feature Alignment and Complementary Fusion (FACF).
View Article and Find Full Text PDFJ Clin Med
December 2024
Clinic of Anaesthesiology and Intensive Care, Central Clinical Hospital, Medical University of Lodz, 92-213 Lodz, Poland.
The text discusses the case of a patient who experienced pneumopericardium because of a traumatic incident. It discusses pneumopericardium's causes, symptoms, and complications, including tamponade symptoms and imaging modalities, to confirm the diagnosis and assess complications. Present various treatment options emphasize the importance of ongoing monitoring and damage control principles.
View Article and Find Full Text PDFNeural Netw
December 2024
Department of Computing, Hong Kong Polytechnic University, Hong Kong SAR, China.
Visual question generation involves the generation of meaningful questions about an image. Although we have made significant progress in automatically generating a single high-quality question related to an image, existing methods often ignore the diversity and interpretability of generated questions, which are important for various daily tasks that require clear question sources. In this paper, we propose an explicitly diverse visual question generation model that aims to generate diverse questions based on interpretable question sources.
View Article and Find Full Text PDFPeerJ Comput Sci
November 2024
Shi Jia Zhuang University of Applied Technology, Shijiazhuang, Hebei, China.
In the era of continuous development of computer technology, the application of artificial intelligence (AI) and big data is becoming more and more extensive. With the help of powerful computer and network technology, the art of visual communication (VISCOM) has ushered in a new chapter of digitalization and intelligence. How vision can better perform interdisciplinary and interdisciplinary artistic expression between art and technology and how to use more novel technology, richer forms, and more appropriate ways to express art has become a new problem in visual art creation.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!