EXTRACTING THE MAIN CONTENT OF WEB DOCUMENTS BASED ON A NAIVE SMOOTHING METHOD

Hadi Mohammadzadeh; Thomas Gottron; Franz Schweiggert; Gholamreza Nakhaeizadeh

doi:10.5220/0003665304700475

EXTRACTING THE MAIN CONTENT OF WEB DOCUMENTS BASED ON A NAIVE SMOOTHING METHOD

Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert, Gholamreza Nakhaeizadeh

2011

Abstract

Extracting the main content of web documents, with high accuracy, is an important challenge for researchers working on the web. In this paper, we present a novel language-independent method for extracting the main content of web pages. Our method, called DANAg, in comparison with other main content extraction approaches has high performance in terms of effectiveness and efficiency. The extraction process of data DANAg is divided into four phases. In the first phase, we calculate the length of content and code of fixed segments in an HTML file. The second phase applies a naive smoothing method to highlight the segments forming the main content. After that, we use a simple algorithm to recognize the boundary of the main content in an HTML file. Finally, we feed the selected main content area to our parser in order to extract the main content of the targeted web page.

Download

Paper Citation

in Harvard Style

Mohammadzadeh H., Gottron T., Schweiggert F. and Nakhaeizadeh G. (2011). EXTRACTING THE MAIN CONTENT OF WEB DOCUMENTS BASED ON A NAIVE SMOOTHING METHOD . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011) ISBN 978-989-8425-79-9, pages 462-467. DOI: 10.5220/0003665304700475

in Bibtex Style

@conference{kdir11,
author={Hadi Mohammadzadeh and Thomas Gottron and Franz Schweiggert and Gholamreza Nakhaeizadeh},
title={EXTRACTING THE MAIN CONTENT OF WEB DOCUMENTS BASED ON A NAIVE SMOOTHING METHOD},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)},
year={2011},
pages={462-467},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003665304700475},
isbn={978-989-8425-79-9},
}

in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)
TI - EXTRACTING THE MAIN CONTENT OF WEB DOCUMENTS BASED ON A NAIVE SMOOTHING METHOD
SN - 978-989-8425-79-9
AU - Mohammadzadeh H.
AU - Gottron T.
AU - Schweiggert F.
AU - Nakhaeizadeh G.
PY - 2011
SP - 462
EP - 467
DO - 10.5220/0003665304700475