Architecture for a Garbage-less and Fresh Content Search Engine

Víctor M. Prieto, Manuel Álvarez, Rafael López García, Fidel Cacheda

2012

Abstract

This paper presents the architecture of a Web search engine that integrates solutions for several state-of-the-art problems, such as Web Spam and Soft-404 detection, content update and resource use. To this end, the system incorporates a Web Spam detection module that is based on techniques that have been presented in previous works and whose success have been assessed in well-known public datasets. For the Soft-404 pages we propose some new techniques that improve the ones described in the state of the art. Finally, a last module allows the search engine to detect when a page has changed considering the user interaction. The tests we have performed allow us to conclude that, with the architecture we propose, it is possible to achieve important improvements in the efficacy and the efficiency of crawling systems. This has repercussions in the content that is provided to the users.

Download


Paper Citation


in Harvard Style

M. Prieto V., Álvarez M., López García R. and Cacheda F. (2012). Architecture for a Garbage-less and Fresh Content Search Engine . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 378-381. DOI: 10.5220/0004167203780381

in Bibtex Style

@conference{kdir12,
author={Víctor M. Prieto and Manuel Álvarez and Rafael López García and Fidel Cacheda},
title={Architecture for a Garbage-less and Fresh Content Search Engine},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={378-381},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004167203780381},
isbn={978-989-8565-29-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - Architecture for a Garbage-less and Fresh Content Search Engine
SN - 978-989-8565-29-7
AU - M. Prieto V.
AU - Álvarez M.
AU - López García R.
AU - Cacheda F.
PY - 2012
SP - 378
EP - 381
DO - 10.5220/0004167203780381