Optical character recognition (OCR) is vital in digitizing printed data into a digital format, which can be conveniently used for various purposes. A significant amount of work has been done in OCR for well-resourced languages like English. However, languages like Urdu, spoken by a large community, face limitations in OCR due to a lack of resources and the complexity and diversity of handwritten scripts. One of the major hindrances in the development of OCR for low-resource languages like Urdu is the lack of extensive datasets. However, such datasets can be obtained from old handwritten books with reference text available online. This study presents a method to leverage this resource and automatically process Urdu handwritten poetry books with corresponding scripts available online. The images are segmented at the sentence level using automated neighborhood-connected component analysis, followed by manual adjustment. Corresponding Unicode text for each image are obtained by web scraping followed by text similarity analysis. A sample dataset collected comprises purely handwritten Urdu text images for Urdu poetry by Mirza Ghalib and Allama Iqbal, arguably the two most influential poets in Urdu. The dataset comprises 2888 images with Unicode transcriptions from poetry by Mirza Ghalib and Allama Iqbal.•The method automates OCR dataset creation by segmenting handwritten text images and scraping corresponding text from the web for alignment.•Handwritten images are segmented into sentences using a resource-efficient Neighborhood Component Analysis approach.•Possible text samples are scraped from the web, and the corresponding labels are aligned with images based on the minimum edit distance between the scraped text and the predictions by an OCR engine.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11743332 | PMC |
http://dx.doi.org/10.1016/j.mex.2024.103130 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!