WEB PAGE CLASSIFICATION CONSIDERING PAGE GROUP STRUCTURE FOR BUILDING A HIGH-QUALITY HOMEPAGE COLLECTION

Yuxin Wang, Keizo Oyama

2007

Abstract

We propose a web page classification method for creating a high quality homepage collection considering page group structure. We use support vector machine (SVM) with textual features obtained from each page and its surrounding pages. The surrounding pages are grouped according to connection type (in-link, outlink, and directory entry) and relative URL hierarchy (same, upper, or lower); then an independent feature subset is generated from each group. Feature subsets are further concatenated to compose the feature set of a classifier. The experiment results using ResJ-01 data set manually created by the authors and WebKB data set show the effectiveness of the proposed features compared with a baseline and some prior works. By tuning the classifiers, we then build a three-way classifier using a recall-assured and a precision-assured classifier in combination to accurately select the pages that need manual assessment to assure the required quality. It is also shown to be effective for reducing the amount of manual assessment.

Download


Paper Citation


in Harvard Style

Wang Y. and Oyama K. (2007). WEB PAGE CLASSIFICATION CONSIDERING PAGE GROUP STRUCTURE FOR BUILDING A HIGH-QUALITY HOMEPAGE COLLECTION . In Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 2: WEBIST, ISBN 978-972-8865-78-8, pages 170-175. DOI: 10.5220/0001271701700175

in Bibtex Style

@conference{webist07,
author={Yuxin Wang and Keizo Oyama},
title={WEB PAGE CLASSIFICATION CONSIDERING PAGE GROUP STRUCTURE FOR BUILDING A HIGH-QUALITY HOMEPAGE COLLECTION},
booktitle={Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,},
year={2007},
pages={170-175},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001271701700175},
isbn={978-972-8865-78-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,
TI - WEB PAGE CLASSIFICATION CONSIDERING PAGE GROUP STRUCTURE FOR BUILDING A HIGH-QUALITY HOMEPAGE COLLECTION
SN - 978-972-8865-78-8
AU - Wang Y.
AU - Oyama K.
PY - 2007
SP - 170
EP - 175
DO - 10.5220/0001271701700175