Urdu text in natural scene images: a new dataset and preliminary text detection.

Hazrat Ali Khalid Iqbal Ghulam Mujtaba Ahmad Fayyaz Mohammad Farhad Bulbul Fazal Wahab Karam Ali Zahir

PeerJ Comput Sci

Department of Electrical and Computer Engineering, COMSATS University Islamabad, Abbottabad Campus, Abbottabad, Pakistan.

Published: September 2021

Text detection in natural scene images for content analysis is an interesting task. The research community has seen some great developments for English/Mandarin text detection. However, Urdu text extraction in natural scene images is a task not well addressed. In this work, firstly, a new dataset is introduced for Urdu text in natural scene images. The dataset comprises of 500 standalone images acquired from real scenes. Secondly, the channel enhanced Maximally Stable Extremal Region (MSER) method is applied to extract Urdu text regions as candidates in an image. Two-stage filtering mechanism is applied to eliminate non-candidate regions. In the first stage, text and noise are classified based on their geometric properties. In the second stage, a support vector machine classifier is trained to discard non-text candidate regions. After this, text candidate regions are linked using centroid-based vertical and horizontal distances. Text lines are further analyzed by a different classifier based on HOG features to remove non-text regions. Extensive experimentation is performed on the locally developed dataset to evaluate the performance. The experimental results show good performance on test set images. The dataset will be made available for research use. To the best of our knowledge, the work is the first of its kind for the Urdu language and would provide a good dataset for free research use and serve as a baseline performance on the task of Urdu text extraction.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8459794	PMC
http://dx.doi.org/10.7717/peerj-cs.717	DOI Listing

Publication Analysis

Top Keywords

urdu text

natural scene

scene images

images dataset

text detection

text

text natural

text extraction

candidate regions

urdu

Similar Publications

A dataset of Roman Urdu text with spelling variations for sentence level sentiment analysis.

Data Brief

December 2024

Department of Information Technology, University of Sindh, Jamshoro, Pakistan.

Mudasar Ahmed Soomro Rafia Naz Memon Asghar Ali Chandio Mehwish Leghari Muhammad Hanif Soomro

Roman Urdu text is very widespread on many websites. People mostly prefer to give their social comments or product reviews in Roman Urdu, and Roman Urdu is counted as non-standard language. The main reason for this is that there is no rule for word spellings within Roman Urdu words, so people create and post their own word spellings, like "2mro" is a nonstandard spelling for tomorrow.

View Article and Find Full Text PDF

Similar Publications

Machine learning based framework for fine-grained word segmentation and enhanced text normalization for low resourced language.

PeerJ Comput Sci

January 2024

Department of Computer Science, National Textile University, Faisalabad, Pakistan.

Shahzad Nazir Muhammad Asif Mariam Rehman Shahbaz Ahmad

In text applications, pre-processing is deemed as a significant parameter to enhance the outcomes of natural language processing (NLP) chores. Text normalization and tokenization are two pivotal procedures of text pre-processing that cannot be overstated. Text normalization refers to transforming raw text into scriptural standardized text, while word tokenization splits the text into tokens or words.

View Article and Find Full Text PDF

Similar Publications

An automated approach to identify sarcasm in low-resource language.

PLoS One

December 2024

Department of Computer Science, Al Ain University, Al Ain, UAE.

Shumaila Khan Iqbal Qasim Wahab Khan Aurangzeb Khan Javed Ali Khan

Sarcasm detection has emerged due to its applicability in natural language processing (NLP) but lacks substantial exploration in low-resource languages like Urdu, Arabic, Pashto, and Roman-Urdu. While fewer studies identifying sarcasm have focused on low-resource languages, most of the work is in English. This research addresses the gap by exploring the efficacy of diverse machine learning (ML) algorithms in identifying sarcasm in Urdu.

View Article and Find Full Text PDF

Similar Publications

Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization.

Sci Rep

November 2024

Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, 38541, Republic of Korea.

Waqar Ashiq Samra Kanwal Adnan Rafique Muhammad Waqas Tahir Khurshaid

With the rapid increase of users over social media, cyberbullying, and hate speech problems have arisen over the past years. Automatic hate speech detection (HSD) from text is an emerging research problem in natural language processing (NLP). Researchers developed various approaches to solve the automatic hate speech detection problem using different corpora in various languages, however, research on the Urdu language is rather scarce.

View Article and Find Full Text PDF

Similar Publications

IndicDialogue: A dataset of subtitles in 10 Indic languages for Indic language modeling.

Data Brief

August 2024

Department of Computer Science, American International University-Bangladesh, Dhaka, Bangladesh.

Noor Mairukh Khan Arnob A Faiyaz Md Mubtasim Fuad Shah Murtaza Rashid Al Masud Baivab Das

The Languages of the Indian subcontinent are less represented in current NLP literature. To mitigate this gap, we present the IndicDialogue dataset, which contains subtitles and dialogues in 10 major Indic languages: Hindi, Bengali, Marathi, Telugu, Tamil, Urdu, Odia, Sindhi, Nepali, and Assamese. This dataset is sourced from OpenSubtitles.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!