EFFICIENTLY LOCATING COLLECTIONS OF WEB PAGES TO WRAP
Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo
2005
Abstract
Many large web sites contain highly valuable information. Their pages are dynamically generated by scripts which retrieve data from a back-end database and embed them into HTML templates. Based on this observation several techniques have been developed to automatically extract data from a set of structurally homogeneous pages. These tools represent a step towards the automatic extraction of data from large web sites, but currently their input sample pages have to be manually collected. To scale the data extraction process this task should be automated, as well. We present techniques to automatically gathering structurally similar pages from large web sites. We have developed an algorithm that takes as input one sample page, and crawls the site to find pages similar in structure to the given page. The collected pages can feed an automatic wrapper generator to extract data. Experiments conducted over real life web sites gave us encouraging results.
DownloadPaper Citation
in Harvard Style
Blanco L., Crescenzi V. and Merialdo P. (2005). EFFICIENTLY LOCATING COLLECTIONS OF WEB PAGES TO WRAP . In Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 972-8865-20-1, pages 247-254. DOI: 10.5220/0001234202470254
in Bibtex Style
@conference{webist05,
author={Lorenzo Blanco and Valter Crescenzi and Paolo Merialdo},
title={EFFICIENTLY LOCATING COLLECTIONS OF WEB PAGES TO WRAP},
booktitle={Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2005},
pages={247-254},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001234202470254},
isbn={972-8865-20-1},
}
in EndNote Style
TY - CONF
JO - Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - EFFICIENTLY LOCATING COLLECTIONS OF WEB PAGES TO WRAP
SN - 972-8865-20-1
AU - Blanco L.
AU - Crescenzi V.
AU - Merialdo P.
PY - 2005
SP - 247
EP - 254
DO - 10.5220/0001234202470254