Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation
Nuno Moniz, Fátima Rodrigues
2012
Abstract
This paper presents an approach for text processing of PDF documents with well-defined layout structure. The scope of the approach is to explore the font’s structure of PDF documents, using perceptual grouping. It consists on the extraction of text objects from the content stream of the documents and its grouping according to a set criterion, making also use of geometric-based regions in order to achieve the correct reading order. The developed approach processes the PDF documents using logical and structural rules to extract the entities present in them, and returns an optimized XML representation of the PDF document, useful for re-use, for example in text categorization. The system was trained and tested with Portuguese Legislation PDF documents extracted from the electronic Republic’s Diary. Evaluation results show that our approach presents good results.
DownloadPaper Citation
in Harvard Style
Moniz N. and Rodrigues F. (2012). Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 123-131. DOI: 10.5220/0004103501230131
in Bibtex Style
@conference{kdir12,
author={Nuno Moniz and Fátima Rodrigues},
title={Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={123-131},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004103501230131},
isbn={978-989-8565-29-7},
}
in EndNote Style
TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation
SN - 978-989-8565-29-7
AU - Moniz N.
AU - Rodrigues F.
PY - 2012
SP - 123
EP - 131
DO - 10.5220/0004103501230131