William Clocksin

Handwriting Transcription in Arabic, Syriac and Thai


qnoma2
Certain languages offer new directions for developing techniques for transcription (transliteration) of handwriting. I focus on Arabic and Syriac because there are thousands of historical documents that scholars wish to be transcribed into machine readable form. These documents date from the 6th century AD, and are scribe written or, after 1700, manually typeset. Therefore we expect to see some natural variation in the script: moreso than in modern typewritten or machine typeset documents, but less variation than in free casual handwriting. This control on variation provides a feasible set of problems to solve. The Thai language is also interesting because words are not separated by space, and therefore additional context is required for transliteration.

My current system, written in Java, transcribes Syriac with high accuracy from old printed sources in all three scripts (Estrangelo, Serto, and East Syriac). It recognises individual characters and diacritical marks by matching contours using a dynamic programming approach. Match scores are recorded in a trellis. A knapsack decoder is used to find the optimal character sequence from the trellis. A lexicon is not required. It is planned to use this system to convert a 4,800-page Syriac document into machine readable form. This system can also be used for transcription of Arabic.

Previous methods I have investigated with collaborators have used Hidden Markov Models, support vector machines, and Bayesian approaches. Collaborators have included: Mohammad Khorsheed (Arabic), Prem Fernando (Estrangelo) and Roonroj Nopsuwanchai (Thai).