Automatically Extracting Complex Data Structures from the Web
Laura Fontán, Rafael López-García, Manuel Álvarez, Alberto Pan
2012
Abstract
This paper presents a new technique for detecting and extracting lists of structured records from Web pages. With respect to most of the state-of-the-art systems, our approach is capable of detecting nested data structures (sublists) and it also incorporates some heuristics to delete unwanted content such as banners and navigation menus from the data region. This article also describes the experiments we have performed to validate the system. The precision and recall we have obtained in our tests surpass 90%.
DownloadPaper Citation
in Harvard Style
Fontán L., López-García R., Álvarez M. and Pan A. (2012). Automatically Extracting Complex Data Structures from the Web . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 246-251. DOI: 10.5220/0004140802460251
in Bibtex Style
@conference{kdir12,
author={Laura Fontán and Rafael López-García and Manuel Álvarez and Alberto Pan},
title={Automatically Extracting Complex Data Structures from the Web},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={246-251},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004140802460251},
isbn={978-989-8565-29-7},
}
in EndNote Style
TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - Automatically Extracting Complex Data Structures from the Web
SN - 978-989-8565-29-7
AU - Fontán L.
AU - López-García R.
AU - Álvarez M.
AU - Pan A.
PY - 2012
SP - 246
EP - 251
DO - 10.5220/0004140802460251