CLASSIFYING WEB PAGES BY GENRE - A Distance Function Approach

Jane E. Mason, Michael Shepherd, Jack Duffy

2009

Abstract

The research reported in this paper is part of a larger project on the automatic classification of Web pages by their genres, using a distance function classification model. In this paper, we investigate the effect of several commonly used data preprocessing steps, explore the use of byte and word n-grams, and test our classification model on three Web page data sets. Our approach is to represent each Web page by a profile that is composed of fixed-length n-grams and their normalized frequencies within the document. Similarly, each of the genres in a data set is represented by a profile that is constructed by combining the n-gram profiles for each exemplar Web page of that genre, forming a centroid profile for each Web page genre. We use a distance function approach to determine the similarity between two profiles, assigning each Web page the label of the genre profile to which its profile is most similar. Our results compare very favorably to those of other researchers.

Download


Paper Citation


in Harvard Style

E. Mason J., Shepherd M. and Duffy J. (2009). CLASSIFYING WEB PAGES BY GENRE - A Distance Function Approach . In Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8111-81-4, pages 646-653. DOI: 10.5220/0001837706460653

in Bibtex Style

@conference{webist09,
author={Jane E. Mason and Michael Shepherd and Jack Duffy},
title={CLASSIFYING WEB PAGES BY GENRE - A Distance Function Approach},
booktitle={Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2009},
pages={646-653},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001837706460653},
isbn={978-989-8111-81-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - CLASSIFYING WEB PAGES BY GENRE - A Distance Function Approach
SN - 978-989-8111-81-4
AU - E. Mason J.
AU - Shepherd M.
AU - Duffy J.
PY - 2009
SP - 646
EP - 653
DO - 10.5220/0001837706460653