FILLING THE GAPS USING GOOGLE 5-GRAMS CORPUS

Costin-Gabriel Chiru, Andrei Hanganu, Traian Rebedea, Stefan Trausan-Matu

2010

Abstract

In this paper we present a text recovery method based on a probabilistic post-recognition processing of the output of an Optical Character Recognition system. The proposed method is trying to fill in the gaps of missing text resulted from the recognition process of degraded documents. For this task, a corpus of up to 5-grams provided by Google is used. Several heuristics for using this corpus for the fulfilment of this task are described after presenting the general problem and alternative solutions. These heuristics have been validated using a set of experiments that are also discussed together with the results that have been obtained.

Download


Paper Citation


in Harvard Style

Chiru C., Hanganu A., Rebedea T. and Trausan-Matu S. (2010). FILLING THE GAPS USING GOOGLE 5-GRAMS CORPUS . In Proceedings of the 5th International Conference on Software and Data Technologies - Volume 2: ICSOFT, ISBN 978-989-8425-23-2, pages 438-443. DOI: 10.5220/0002932204380443

in Bibtex Style

@conference{icsoft10,
author={Costin-Gabriel Chiru and Andrei Hanganu and Traian Rebedea and Stefan Trausan-Matu},
title={FILLING THE GAPS USING GOOGLE 5-GRAMS CORPUS},
booktitle={Proceedings of the 5th International Conference on Software and Data Technologies - Volume 2: ICSOFT,},
year={2010},
pages={438-443},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002932204380443},
isbn={978-989-8425-23-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 5th International Conference on Software and Data Technologies - Volume 2: ICSOFT,
TI - FILLING THE GAPS USING GOOGLE 5-GRAMS CORPUS
SN - 978-989-8425-23-2
AU - Chiru C.
AU - Hanganu A.
AU - Rebedea T.
AU - Trausan-Matu S.
PY - 2010
SP - 438
EP - 443
DO - 10.5220/0002932204380443