A scarce dataset for ancient Arabic handwritten text recognition.

Data Brief

Department of Computer Science, Islamic University, Madinah 42351, Saudi Arabia.

Published: October 2024

Developing Deep Learning Optical Character Recognition is an active area of research, where models based on deep neural networks are trained on data to eventually extract text within an image. Even though many advances are currently being made in this area in general, the Arabic OCR domain notably lacks a dataset for ancient manuscripts. Here, we fill this gap by providing both the image and textual ground truth for a collection of ancient Arabic manuscripts. This scarce dataset is collected from the central library of the Islamic University of Madinah, and it encompasses rich text spanning different geographies across centuries. Specifically, eight ancient books with a total of forty pages, both images and text, transcribed by the experts, are present in this dataset. Particularly, this dataset holds a significant value due to the unavailability of such data publicly, which conspicuously contributes to the deep learning models development/augmenting, validation, testing, and generalization by researchers and practitioners, both for the tasks of Arabic OCR and Arabic text correction.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11381460PMC
http://dx.doi.org/10.1016/j.dib.2024.110813DOI Listing

Publication Analysis

Top Keywords

scarce dataset
8
dataset ancient
8
ancient arabic
8
deep learning
8
arabic ocr
8
arabic
5
text
5
ancient
4
arabic handwritten
4
handwritten text
4

Similar Publications

De novo transcriptome assembly of the Perna viridis: A novel invertebrate model for ecotoxicological studies.

Sci Data

January 2025

Marine Biotechnology Fish Nutrition and Health Division, Central Marine Fisheries Research Institute, Post Box No 1603 Ernakulam North PO., Kochi, 682018, Kerala, India.

Mussels, particularly Perna viridis, are vital sentinel species for toxicology and biomonitoring in environmental health. This species plays a crucial role in aquaculture and significantly impacts the fisheries sector. Despite the ecological and economic importance of this species, its omics resources are still scarce.

View Article and Find Full Text PDF

Accurately identifying and discriminating between different brain states is a major emphasis of functional brain imaging research. Various machine learning techniques play an important role in this regard. However, when working with a small number of study participants, the lack of sufficient data and achieving meaningful classification results remain a challenge.

View Article and Find Full Text PDF

For surveillance video management in university laboratories, issues such as occlusion and low-resolution face capture often arise. Traditional face recognition algorithms are typically static and rely heavily on clear images, resulting in inaccurate recognition for low-resolution, small-sized faces. To address the challenges of occlusion and low-resolution person identification, this paper proposes a new face recognition framework by reconstructing Retinaface-Resnet and combining it with Quality-Adaptive Margin (adaface).

View Article and Find Full Text PDF

The Potential Clinical Utility of the Customized Large Language Model in Gastroenterology: A Pilot Study.

Bioengineering (Basel)

December 2024

College of Liberal Arts Faculty of Basic Liberal Art, Hansung University, Seoul 02876, Republic of Korea.

The large language model (LLM) has the potential to be applied to clinical practice. However, there has been scarce study on this in the field of gastroenterology. Aim: This study explores the potential clinical utility of two LLMs in the field of gastroenterology: a customized GPT model and a conventional GPT-4o, an advanced LLM capable of retrieval-augmented generation (RAG).

View Article and Find Full Text PDF

serovar Gallinarum biovar Gallinarum is a pathogenic bacterium that causes fowl typhoid (FT), affecting chicken flocks worldwide. This study aimed to evaluate the emergence, dissemination and genomic profile of Gallinarum lineages from Brazil. Twelve whole-genomes sequences (WGS) of different .

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!