Rule Management for Information Extraction from Title Pages of Academic Papers

Atsuhiro Takasu, Manabu Ohta

2014

Abstract

This paper discusses the problem of managing rules for page layout analysis and information extraction. We have been developing a system to extract information from academic papers that exploits both page layout and textual information. For this purpose, a conditional random field (CRF) analyzer is designed according to the layout of the object pages. Because various layouts are used in academic papers, we must prepare a set of rules for each type of layout to achieve high extraction accuracy. As the number of papers in a system grows, rule management becomes a big problem. For example, when should we make a new set of rules, and how can we acquire them efficiently while receiving new articles? This paper examines two scores to measure the fitness of rules and the applicability of rules learned for another type of layout. We evaluate the scores for bibliographic information extraction from title pages of academic papers and show that they are effective for measuring the fitness. We also examine the sampling of training data when learning a new set of rules.

Download


Paper Citation


in Harvard Style

Takasu A. and Ohta M. (2014). Rule Management for Information Extraction from Title Pages of Academic Papers . In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-018-5, pages 438-444. DOI: 10.5220/0004827204380444

in Bibtex Style

@conference{icpram14,
author={Atsuhiro Takasu and Manabu Ohta},
title={Rule Management for Information Extraction from Title Pages of Academic Papers},
booktitle={Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2014},
pages={438-444},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004827204380444},
isbn={978-989-758-018-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Rule Management for Information Extraction from Title Pages of Academic Papers
SN - 978-989-758-018-5
AU - Takasu A.
AU - Ohta M.
PY - 2014
SP - 438
EP - 444
DO - 10.5220/0004827204380444