Robust Template Identification of Scanned Documents

Xiaofan Feng, Abdou Youssef, Sithu Sudarsan

2012

Abstract

Identification of low-quality scanned documents is not trivial in real-world settings. Existing research mainly focusing on similarity-based approaches rely on perfect string data from a document. Also, studies using image processing techniques for document identification rely on clean data and large differences among templates. Both these approaches fail to maintain accuracy in the context of noisy data or when document templates are too similar to each other. In this paper, a probabilistic approach is proposed to identify the document template of scanned documents. The proposed algorithm works on imperfect OCR output and document collections containing very similar templates. Through experiment and analysis, this novel probabilistic approach is shown to achieve high accuracy on different data sets.

Download


Paper Citation


in Harvard Style

Feng X., Youssef A. and Sudarsan S. (2012). Robust Template Identification of Scanned Documents . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 103-110. DOI: 10.5220/0004144601030110

in Bibtex Style

@conference{kdir12,
author={Xiaofan Feng and Abdou Youssef and Sithu Sudarsan},
title={Robust Template Identification of Scanned Documents},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={103-110},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004144601030110},
isbn={978-989-8565-29-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - Robust Template Identification of Scanned Documents
SN - 978-989-8565-29-7
AU - Feng X.
AU - Youssef A.
AU - Sudarsan S.
PY - 2012
SP - 103
EP - 110
DO - 10.5220/0004144601030110