qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations

Jingwen Wang, Jie Wang

2015

Abstract

We present a new method called qRead to achieve real-time content extractions from web pages with high accuracy. Early approaches to content extractions include empirical filtering rules, Document Object Model (DOM) trees, and machine learning models. These methods, while having met with certain success, may not meet the requirements of real-time extraction with high accuracy. For example, constructing a DOM-tree on a complex web page is time-consuming, and using machine learning models could make things unnecessarily more complicated. Different from previous approaches, qRead uses segment densities and similarities to identify main contents. In particular, qRead first filters obvious junk contents, eliminates HTML tags, and partitions the remaining text into natural segments. It then uses the highest ratio of words over the number of lines in a segment combined with similarity between the segment and the title to identify main contents. We show that, through extensive experiments, qRead achieves a 96.8% accuracy on Chinese web pages with an average extraction time of 13.20 milliseconds, and a 93.6% accuracy on English web pages with an average extraction time of 11.37 milliseconds, providing substantial improvements on accuracy over previous approaches and meeting the real-time extraction requirement.

Download


Paper Citation


in Harvard Style

Wang J. and Wang J. (2015). qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations . In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015) ISBN 978-989-758-158-8, pages 364-371. DOI: 10.5220/0005613603640371

in Bibtex Style

@conference{kdir15,
author={Jingwen Wang and Jie Wang},
title={qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations},
booktitle={Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015)},
year={2015},
pages={364-371},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005613603640371},
isbn={978-989-758-158-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015)
TI - qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations
SN - 978-989-758-158-8
AU - Wang J.
AU - Wang J.
PY - 2015
SP - 364
EP - 371
DO - 10.5220/0005613603640371