THE IMPACT OF SOURCE CODE NORMALIZATION ON MAIN CONTENT EXTRACTION

Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert, Gholamreza Nakhaeizadeh

2012

Abstract

In this paper, we introduce AdDANAg, a language-independent approach to extract the main content of web documents. The approach combines best-of-breed techniques from recent content extraction approaches to yield better extraction results. This combination of techniques brings together two pre-processing steps, e.g. to normalize the document presentation and reduce the impact of certain syntactical structures, and four phases for the actual content extraction. We show that AdDANAg demonstrates a high performance in terms of effectiveness and efficiency and outperforms previous approaches.

Download


Paper Citation


in Harvard Style

Mohammadzadeh H., Gottron T., Schweiggert F. and Nakhaeizadeh G. (2012). THE IMPACT OF SOURCE CODE NORMALIZATION ON MAIN CONTENT EXTRACTION . In Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8565-08-2, pages 677-682. DOI: 10.5220/0003931906770682

in Bibtex Style

@conference{webist12,
author={Hadi Mohammadzadeh and Thomas Gottron and Franz Schweiggert and Gholamreza Nakhaeizadeh},
title={THE IMPACT OF SOURCE CODE NORMALIZATION ON MAIN CONTENT EXTRACTION},
booktitle={Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2012},
pages={677-682},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003931906770682},
isbn={978-989-8565-08-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - THE IMPACT OF SOURCE CODE NORMALIZATION ON MAIN CONTENT EXTRACTION
SN - 978-989-8565-08-2
AU - Mohammadzadeh H.
AU - Gottron T.
AU - Schweiggert F.
AU - Nakhaeizadeh G.
PY - 2012
SP - 677
EP - 682
DO - 10.5220/0003931906770682