Automatically Extracting Complex Data Structures from the Web

Laura Fontán, Rafael López-García, Manuel Álvarez, Alberto Pan

2012

Abstract

This paper presents a new technique for detecting and extracting lists of structured records from Web pages. With respect to most of the state-of-the-art systems, our approach is capable of detecting nested data structures (sublists) and it also incorporates some heuristics to delete unwanted content such as banners and navigation menus from the data region. This article also describes the experiments we have performed to validate the system. The precision and recall we have obtained in our tests surpass 90%.

Download


Paper Citation


in Harvard Style

Fontán L., López-García R., Álvarez M. and Pan A. (2012). Automatically Extracting Complex Data Structures from the Web . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 246-251. DOI: 10.5220/0004140802460251

in Bibtex Style

@conference{kdir12,
author={Laura Fontán and Rafael López-García and Manuel Álvarez and Alberto Pan},
title={Automatically Extracting Complex Data Structures from the Web},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={246-251},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004140802460251},
isbn={978-989-8565-29-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - Automatically Extracting Complex Data Structures from the Web
SN - 978-989-8565-29-7
AU - Fontán L.
AU - López-García R.
AU - Álvarez M.
AU - Pan A.
PY - 2012
SP - 246
EP - 251
DO - 10.5220/0004140802460251