USING UTF-8 TO EXTRACT MAIN CONTENT OF RIGHT TO LEFT LANGUAGE WEB PAGES

Hadi Mohammadzadeh, Franz Schweiggert, Gholamreza Nakhaeizadeh

2011

Abstract

In this paper, we propose a new and simple approach to extract the main content of Right to Left language web pages. Independence to DOM tree and HTML tags is one of the most important features of the proposed algorithm. In practice, HTML tags have been written in English and we know that the English character set is located in the interval [0,127]. In most languages which are written from Right-to-Left (R2L) such as the Arabic language, however, a definite interval of the Unicode character set is used that is certainly not in this interval. In the first phase of our approach, we apply this distinction to separate the R2L characters from the English ones. Then for each HTML file, we determine the density of the R2L characters and the density of Non-R2L characters. That part of the HTML file with high density of the R2L characters and low density of the Non-R2L characters contains the main content of the web page with high accuracy. The proposed algorithm has been tested, evaluated and compared with the last main content extraction approach on 2166 selected web pages.

Download


Paper Citation


in Harvard Style

Mohammadzadeh H., Schweiggert F. and Nakhaeizadeh G. (2011). USING UTF-8 TO EXTRACT MAIN CONTENT OF RIGHT TO LEFT LANGUAGE WEB PAGES . In Proceedings of the 6th International Conference on Software and Database Technologies - Volume 1: ICSOFT, ISBN 978-989-8425-76-8, pages 243-249. DOI: 10.5220/0003508502430249

in Bibtex Style

@conference{icsoft11,
author={Hadi Mohammadzadeh and Franz Schweiggert and Gholamreza Nakhaeizadeh},
title={USING UTF-8 TO EXTRACT MAIN CONTENT OF RIGHT TO LEFT LANGUAGE WEB PAGES},
booktitle={Proceedings of the 6th International Conference on Software and Database Technologies - Volume 1: ICSOFT,},
year={2011},
pages={243-249},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003508502430249},
isbn={978-989-8425-76-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Software and Database Technologies - Volume 1: ICSOFT,
TI - USING UTF-8 TO EXTRACT MAIN CONTENT OF RIGHT TO LEFT LANGUAGE WEB PAGES
SN - 978-989-8425-76-8
AU - Mohammadzadeh H.
AU - Schweiggert F.
AU - Nakhaeizadeh G.
PY - 2011
SP - 243
EP - 249
DO - 10.5220/0003508502430249