Rule Management for Information Extraction from Title Pages of Academic Papers
Atsuhiro Takasu, Manabu Ohta
2014
Abstract
This paper discusses the problem of managing rules for page layout analysis and information extraction. We have been developing a system to extract information from academic papers that exploits both page layout and textual information. For this purpose, a conditional random field (CRF) analyzer is designed according to the layout of the object pages. Because various layouts are used in academic papers, we must prepare a set of rules for each type of layout to achieve high extraction accuracy. As the number of papers in a system grows, rule management becomes a big problem. For example, when should we make a new set of rules, and how can we acquire them efficiently while receiving new articles? This paper examines two scores to measure the fitness of rules and the applicability of rules learned for another type of layout. We evaluate the scores for bibliographic information extraction from title pages of academic papers and show that they are effective for measuring the fitness. We also examine the sampling of training data when learning a new set of rules.
DownloadPaper Citation
in Harvard Style
Takasu A. and Ohta M. (2014). Rule Management for Information Extraction from Title Pages of Academic Papers . In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-018-5, pages 438-444. DOI: 10.5220/0004827204380444
in Bibtex Style
@conference{icpram14,
author={Atsuhiro Takasu and Manabu Ohta},
title={Rule Management for Information Extraction from Title Pages of Academic Papers},
booktitle={Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2014},
pages={438-444},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004827204380444},
isbn={978-989-758-018-5},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Rule Management for Information Extraction from Title Pages of Academic Papers
SN - 978-989-758-018-5
AU - Takasu A.
AU - Ohta M.
PY - 2014
SP - 438
EP - 444
DO - 10.5220/0004827204380444