Ab initio Splice Site Prediction with Simple Domain Adaptation Classifiers

Nic Herndon, Doina Caragea

2016

Abstract

The next generation sequencing technologies (NGS) have made it affordable to sequence any organism, opening the door to assembling new genomes and annotating them, even for non-model organisms. One option for annotating a genome is to assemble RNA-Seq reads into a transcriptome and aligning the transcriptome to the genome assembly to identify the protein-encoding genes. However, there are a couple of problems with this approach. RNA-Seq is error prone and therefore the gene models generated with this technique need to be validated. In addition, this method can only capture the genes expressed at the time of sequencing. Machine learning can help address both of these problems by generating ab initio gene models that can provide supporting evidence to the models generated with RNA-Seq, as well as predict additional genes that were not expressed during sequencing. However, machine learning algorithms need large amounts of labeled data to learn accurate classifiers, and newly sequenced, non-model organisms have insufficient labeled data. This can be addressed by leveraging the abundant labeled data from a related model-organism (the source domain) and use it in conjunction with the little labeled data from the organism of interest (the target domain) to train a classifier in a domain adaptation setting. The method we propose uses this approach and generates accurate classification on the task of splice site prediction – a difficult and essential step in gene prediction. It is simple – it combines source and target labeled data, with different weights, into one dataset, and then trains a supervised classifier on the combined dataset. Despite its simplicity it is surprisingly accurate, with highest areas under the precision-recall curve between 53.33% and 83.57%. Out of the domain adaptation classifiers evaluated (SVM, na¨ıve Bayes, and logistic regression) this method produced the best results in 12 out of the 16 cases studied.

Download


Paper Citation


in Harvard Style

Herndon N. and Caragea D. (2016). Ab initio Splice Site Prediction with Simple Domain Adaptation Classifiers . In Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016) ISBN 978-989-758-170-0, pages 245-252. DOI: 10.5220/0005710502450252

in Bibtex Style

@conference{bioinformatics16,
author={Nic Herndon and Doina Caragea},
title={Ab initio Splice Site Prediction with Simple Domain Adaptation Classifiers},
booktitle={Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016)},
year={2016},
pages={245-252},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005710502450252},
isbn={978-989-758-170-0},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016)
TI - Ab initio Splice Site Prediction with Simple Domain Adaptation Classifiers
SN - 978-989-758-170-0
AU - Herndon N.
AU - Caragea D.
PY - 2016
SP - 245
EP - 252
DO - 10.5220/0005710502450252