String Searching in Referentially Compressed Genomes

Sebastian Wandelt, Ulf Leser

2012

Abstract

Background: Improved sequencing techniques have led to large amounts of biological sequence data. One of the challenges in managing sequence data is efficient storage. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. However, so far sequences always have to be decompressed prior to an analysis. There is a need for algorithms working on compressed data directly, avoiding costly decompression. Summary: In our work, we address this problem by proposing an algorithm for exact string search over compressed data. The algorithm works directly on referentially compressed genome sequences, without needing an index for each genome and only using partial decompression. Results: Our string search algorithm for referentially compressed genomes performs exact string matching for large sets of genomes faster than using an index structure, e.g. suffix trees, for each genome, especially for short queries. We think that this is an important step towards space and runtime efficient management of large biological data sets.

Download


Paper Citation


in Harvard Style

Wandelt S. and Leser U. (2012). String Searching in Referentially Compressed Genomes . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 95-102. DOI: 10.5220/0004143400950102

in Bibtex Style

@conference{kdir12,
author={Sebastian Wandelt and Ulf Leser},
title={String Searching in Referentially Compressed Genomes},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={95-102},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004143400950102},
isbn={978-989-8565-29-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - String Searching in Referentially Compressed Genomes
SN - 978-989-8565-29-7
AU - Wandelt S.
AU - Leser U.
PY - 2012
SP - 95
EP - 102
DO - 10.5220/0004143400950102