EXTRACTING OBJECT-RELEVANT DATA FROM WEBSITES
Jianqiang Li, Yu Zhao
2009
Abstract
This paper proposes a method to identify the object relevant information which is distributed across multiple web pages in a website. Many researches have been reported on page-level web data extraction. They assume that the input web pages contain the data records of interested objects. However, in many cases for data mining from a website, the group of web pages describing an object are sparsely distributed in the website. It makes the page-level solutions no longer applicable. This paper exploits the hierarchy model employed by the website builder for web page organization to solve the problem of website-level data extraction. A new resource, the Hierarchical Navigation Path (HNP), which can be discovered from the website structure, is introduced for object relevant web page filtering. The found web pages are clustered based on the URL and semantic hyperlink analysis, and then the entry page and the detailed profile pages of each object are identified. The empirical experiments show the effectiveness of the proposed approach.
DownloadPaper Citation
in Harvard Style
Li J. and Zhao Y. (2009). EXTRACTING OBJECT-RELEVANT DATA FROM WEBSITES . In Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8111-81-4, pages 597-604. DOI: 10.5220/0001823705970604
in Bibtex Style
@conference{webist09,
author={Jianqiang Li and Yu Zhao},
title={EXTRACTING OBJECT-RELEVANT DATA FROM WEBSITES},
booktitle={Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2009},
pages={597-604},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001823705970604},
isbn={978-989-8111-81-4},
}
in EndNote Style
TY - CONF
JO - Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - EXTRACTING OBJECT-RELEVANT DATA FROM WEBSITES
SN - 978-989-8111-81-4
AU - Li J.
AU - Zhao Y.
PY - 2009
SP - 597
EP - 604
DO - 10.5220/0001823705970604