A System for Historical Documents Transcription based on Hierarchical Classification and Dictionary Matching

Camelia Lemnaru, Andreea Sin-Neamțiu, Mihai-Andrei Vereș, Rodica Potolea

2012

Abstract

Information contained in historical sources is highly important for the research of historians; yet, extracting it manually from documents written in difficult scripts is often an expensive and time-consuming process. This paper proposes a modular system for transcribing documents written in a challenging script (German Kurrent Schrift). The solution comprises of three main stages: Document Processing, Word Processing and Word Selector, chained together in a linear pipeline. The system is currently under development, with several modules in each stage already implemented and evaluated. The main focus so far has been on the character recognition module, where a hierarchical classifier is proposed. Preliminary evaluations on the character recognition module has yielded ~ 82% overall character recognition rate, and a series of groups of confusable characters, for which an additional identification model is currently investigated. Also, word composition based on a dictionary matching approach using the Levenshtein distance is presented.

Download


Paper Citation


in Harvard Style

Lemnaru C., Sin-Neamțiu A., Vereș M. and Potolea R. (2012). A System for Historical Documents Transcription based on Hierarchical Classification and Dictionary Matching . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 353-357. DOI: 10.5220/0004143003530357

in Bibtex Style

@conference{kdir12,
author={Camelia Lemnaru and Andreea Sin-Neamțiu and Mihai-Andrei Vereș and Rodica Potolea},
title={A System for Historical Documents Transcription based on Hierarchical Classification and Dictionary Matching},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={353-357},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004143003530357},
isbn={978-989-8565-29-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - A System for Historical Documents Transcription based on Hierarchical Classification and Dictionary Matching
SN - 978-989-8565-29-7
AU - Lemnaru C.
AU - Sin-Neamțiu A.
AU - Vereș M.
AU - Potolea R.
PY - 2012
SP - 353
EP - 357
DO - 10.5220/0004143003530357