2009 IEEE International Conference on
Systems, Man, and Cybernetics |
![]() |
Abstract
In a multilingual environment where a document may contain text lines in more than one language forms, it is necessary to identify different language regions of the document in order to feed the document to the OCRs of individual language. With this context, this paper proposes to develop a monothetic algorithmic model to identify and separate text lines Telugu, Hindi and English languages from a multilingual document. The proposed method uses the distinct features of the target language and searches for the text lines that possess the anticipated features. Experimentation conducted involved 1500 text lines for learning and 900 text lines for testing. The performance has turned out to be 98.5%.