EFFICIENTLY LOCATING COLLECTIONS OF WEB PAGES TO WRAP

Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo

2005

Abstract

Many large web sites contain highly valuable information. Their pages are dynamically generated by scripts which retrieve data from a back-end database and embed them into HTML templates. Based on this observation several techniques have been developed to automatically extract data from a set of structurally homogeneous pages. These tools represent a step towards the automatic extraction of data from large web sites, but currently their input sample pages have to be manually collected. To scale the data extraction process this task should be automated, as well. We present techniques to automatically gathering structurally similar pages from large web sites. We have developed an algorithm that takes as input one sample page, and crawls the site to find pages similar in structure to the given page. The collected pages can feed an automatic wrapper generator to extract data. Experiments conducted over real life web sites gave us encouraging results.

Download


Paper Citation


in Harvard Style

Blanco L., Crescenzi V. and Merialdo P. (2005). EFFICIENTLY LOCATING COLLECTIONS OF WEB PAGES TO WRAP . In Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 972-8865-20-1, pages 247-254. DOI: 10.5220/0001234202470254

in Bibtex Style

@conference{webist05,
author={Lorenzo Blanco and Valter Crescenzi and Paolo Merialdo},
title={EFFICIENTLY LOCATING COLLECTIONS OF WEB PAGES TO WRAP},
booktitle={Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2005},
pages={247-254},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001234202470254},
isbn={972-8865-20-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - EFFICIENTLY LOCATING COLLECTIONS OF WEB PAGES TO WRAP
SN - 972-8865-20-1
AU - Blanco L.
AU - Crescenzi V.
AU - Merialdo P.
PY - 2005
SP - 247
EP - 254
DO - 10.5220/0001234202470254