Wrapper Induction by XPath Alignment

Joachim Nielandt, Robin de Mol, Antoon Bronselaer, Guy de Tré

2014

Abstract

Dealing with a huge quantity of semi-structured documents and the extraction of information therefrom is an important topic that is getting a lot of attention. Methods that allow to accurately define where the data can be found are then pivotal in constructing a robust solution, allowing for imperfections and structural changes in the source material. In this paper we investigate a wrapper induction method that revolves around aligning XPath elements (steps), allowing a user to generalise upon training examples he gives to the data extraction system. The alignment is based on a modification of the well known Levenshtein edit distance. When the training example XPaths have been aligned with each other they are subsequently merged into the path that generalises, as precise as possible, the examples, so it can be used to accurately fetch the required data from the given source material.

Download


Paper Citation


in Harvard Style

Nielandt J., de Mol R., Bronselaer A. and de Tré G. (2014). Wrapper Induction by XPath Alignment . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014) ISBN 978-989-758-048-2, pages 492-500. DOI: 10.5220/0005124504920500

in Bibtex Style

@conference{kdir14,
author={Joachim Nielandt and Robin de Mol and Antoon Bronselaer and Guy de Tré},
title={Wrapper Induction by XPath Alignment},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)},
year={2014},
pages={492-500},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005124504920500},
isbn={978-989-758-048-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)
TI - Wrapper Induction by XPath Alignment
SN - 978-989-758-048-2
AU - Nielandt J.
AU - de Mol R.
AU - Bronselaer A.
AU - de Tré G.
PY - 2014
SP - 492
EP - 500
DO - 10.5220/0005124504920500