Recognition of the script in Serbian documents using frequency occurrence and co-occurrence analysis.

ScientificWorldJournal

Technical Faculty in Bor, University of Belgrade, Vojske Jugoslavije 12, 19210 Bor, Serbia.

Published: June 2014

Any document in Serbian language can be written in two different scripts: Latin or Cyrillic. Although characteristics of these scripts are similar, some of their statistical measures are quite different. The paper proposed a method for the extraction of certain script from document according to the occurrence and co-occurrence of the script types. First, each letter is modeled with the certain script type according to characteristics concerning its position in baseline area. Then, the frequency analysis of the script types occurrence is performed. Due to diversity of Latin and Cyrillic script, the occurrence of modeled letters shows substantial statistics dissimilarity. Furthermore, the co-occurrence matrix is computed. The analysis of the co-occurrence matrix draws a strong margin as a criteria to distinguish and recognize the certain script. The proposed method is analyzed on the case of a database which includes different types of printed and web documents. The experiments gave encouraging results.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3872444PMC
http://dx.doi.org/10.1155/2013/896328DOI Listing

Publication Analysis

Top Keywords

occurrence co-occurrence
8
latin cyrillic
8
proposed method
8
script types
8
co-occurrence matrix
8
script
6
recognition script
4
script serbian
4
serbian documents
4
documents frequency
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!