FILLING THE GAPS USING GOOGLE 5-GRAMS CORPUS
Costin-Gabriel Chiru, Andrei Hanganu, Traian Rebedea, Stefan Trausan-Matu
2010
Abstract
In this paper we present a text recovery method based on a probabilistic post-recognition processing of the output of an Optical Character Recognition system. The proposed method is trying to fill in the gaps of missing text resulted from the recognition process of degraded documents. For this task, a corpus of up to 5-grams provided by Google is used. Several heuristics for using this corpus for the fulfilment of this task are described after presenting the general problem and alternative solutions. These heuristics have been validated using a set of experiments that are also discussed together with the results that have been obtained.
DownloadPaper Citation
in Harvard Style
Chiru C., Hanganu A., Rebedea T. and Trausan-Matu S. (2010). FILLING THE GAPS USING GOOGLE 5-GRAMS CORPUS . In Proceedings of the 5th International Conference on Software and Data Technologies - Volume 2: ICSOFT, ISBN 978-989-8425-23-2, pages 438-443. DOI: 10.5220/0002932204380443
in Bibtex Style
@conference{icsoft10,
author={Costin-Gabriel Chiru and Andrei Hanganu and Traian Rebedea and Stefan Trausan-Matu},
title={FILLING THE GAPS USING GOOGLE 5-GRAMS CORPUS},
booktitle={Proceedings of the 5th International Conference on Software and Data Technologies - Volume 2: ICSOFT,},
year={2010},
pages={438-443},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002932204380443},
isbn={978-989-8425-23-2},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 5th International Conference on Software and Data Technologies - Volume 2: ICSOFT,
TI - FILLING THE GAPS USING GOOGLE 5-GRAMS CORPUS
SN - 978-989-8425-23-2
AU - Chiru C.
AU - Hanganu A.
AU - Rebedea T.
AU - Trausan-Matu S.
PY - 2010
SP - 438
EP - 443
DO - 10.5220/0002932204380443