Towards Unsupervised Word Error Correction in Textual Big Data

Joao Paulo Carvalho, Sérgio Curto

2014

Abstract

Large unedited technical textual databases might contain information that cannot be properly extracted using Natural Language Processing (NLP) tools due to the many existent word errors. A good example is the MIMIC II database, where medical text reports are a direct representation of experts’ views on real time observable data. Such reports contain valuable information that can improve predictive medic decision making models based on physiological data, but have never been used with that goal so far. In this paper we propose a fuzzy based semi-automatic method to specifically address the large number of word errors contained in such databases that will allow the direct application of NLP techniques, such as Bag of Words, to the textual data.

Download


Paper Citation


in Harvard Style

Carvalho J. and Curto S. (2014). Towards Unsupervised Word Error Correction in Textual Big Data . In Proceedings of the International Conference on Fuzzy Computation Theory and Applications - Volume 1: FCTA, (IJCCI 2014) ISBN 978-989-758-053-6, pages 181-186. DOI: 10.5220/0005140401810186

in Bibtex Style

@conference{fcta14,
author={Joao Paulo Carvalho and Sérgio Curto},
title={Towards Unsupervised Word Error Correction in Textual Big Data},
booktitle={Proceedings of the International Conference on Fuzzy Computation Theory and Applications - Volume 1: FCTA, (IJCCI 2014)},
year={2014},
pages={181-186},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005140401810186},
isbn={978-989-758-053-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Fuzzy Computation Theory and Applications - Volume 1: FCTA, (IJCCI 2014)
TI - Towards Unsupervised Word Error Correction in Textual Big Data
SN - 978-989-758-053-6
AU - Carvalho J.
AU - Curto S.
PY - 2014
SP - 181
EP - 186
DO - 10.5220/0005140401810186