A Comparison of Three Pre-processing Methods for Improving Main Content Extraction from Hyperlink Rich Web Documents
Moheb Ghorbani, Hadi Mohammadzadeh, Abdolreza Nazemi
2014
Abstract
Most HTML web documents on the World Wide Web contain a lot of hyperlinks in the body of main content area and additional areas. As extraction of the main content of such hyperlink rich web documents is rather complicated, three simple and language-independent pre-processing main content extraction methods are addressed in this paper to deal with the hyperlinks for identifying the main content accurately. To evaluate and compare the presented methods, each of these three methods is combined with a prominent main content extraction method, called DANAg. The obtained results show that one of the methods delivers a higher performance in term of effectiveness in comparison with the other two suggested methods.
DownloadPaper Citation
in Harvard Style
Ghorbani M., Mohammadzadeh H. and Nazemi A. (2014). A Comparison of Three Pre-processing Methods for Improving Main Content Extraction from Hyperlink Rich Web Documents . In Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST, ISBN 978-989-758-024-6, pages 335-339. DOI: 10.5220/0004947503350339
in Bibtex Style
@conference{webist14,
author={Moheb Ghorbani and Hadi Mohammadzadeh and Abdolreza Nazemi},
title={A Comparison of Three Pre-processing Methods for Improving Main Content Extraction from Hyperlink Rich Web Documents},
booktitle={Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,},
year={2014},
pages={335-339},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004947503350339},
isbn={978-989-758-024-6},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,
TI - A Comparison of Three Pre-processing Methods for Improving Main Content Extraction from Hyperlink Rich Web Documents
SN - 978-989-758-024-6
AU - Ghorbani M.
AU - Mohammadzadeh H.
AU - Nazemi A.
PY - 2014
SP - 335
EP - 339
DO - 10.5220/0004947503350339