Combining N-gram based Similarity Analysis with Sentiment Analysis in Web Content Classification

Shuhua Liu, Thomas Forss

2014

Abstract

This research concerns the development of web content detection systems that will be able to automatically classify any web page into pre-defined content categories. Our work is motivated by practical experience and observations that certain categories of web pages, such as those that contain hatred and violence, are much harder to classify with good accuracy when both content and structural features are already taken into account. To further improve the performance of detection systems, we bring web sentiment features into classification models. In addition, we incorporate n-gram representation into our classification approach, based on the assumption that n-grams can capture more local context information in text, and thus could help to enhance topic similarity analysis. Different from most studies that only consider presence or frequency count of n-grams in their applications, we make use of tf-idf weighted n-grams in building the content classification models. Our result shows that unigram based models, even though a much simpler approach, show their unique value and effectiveness in web content classification. Higher order n-gram based approaches, especially 5-gram based models that combine topic similarity features with sentiment features, bring significant improvement in precision levels for the Violence and two Racism related web categories.

Download


Paper Citation


in Harvard Style

Liu S. and Forss T. (2014). Combining N-gram based Similarity Analysis with Sentiment Analysis in Web Content Classification . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2014) ISBN 978-989-758-048-2, pages 530-537. DOI: 10.5220/0005170305300537

in Bibtex Style

@conference{sstm14,
author={Shuhua Liu and Thomas Forss},
title={Combining N-gram based Similarity Analysis with Sentiment Analysis in Web Content Classification},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2014)},
year={2014},
pages={530-537},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005170305300537},
isbn={978-989-758-048-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2014)
TI - Combining N-gram based Similarity Analysis with Sentiment Analysis in Web Content Classification
SN - 978-989-758-048-2
AU - Liu S.
AU - Forss T.
PY - 2014
SP - 530
EP - 537
DO - 10.5220/0005170305300537