Assessing the Impact of Stemming Algorithms Applied to Judicial Jurisprudence - An Experimental Analysis
Robert A. N. de Oliveira, Methanias Colaço Júnior
2017
Abstract
Stemming algorithms are commonly used during textual preprocessing phase in order to reduce data dimensionality. However, this reduction presents different efficacy levels depending on the domain it is applied. Hence, this work is an experimental analysis about the dimensionality reduction by stemming a veracious base of judicial jurisprudence formed by four subsets of documents. With such document base, it is necessary to adopt techniques that increase the efficiency of storage and search for such information, otherwise there is a loss of both computing resources and access to justice, as stakeholders may not find the document they need to plead their rights. The results show that, depending on the algorithm and the collection, there may be a reduction of up to 52\% of these terms in the documents. Furthermore, we have found a strong correlation between the reduction percentage and the quantity of unique terms in the original document. This way, RSLP algorithm was the most effective in terms of dimensionality reduction, among the stemming algorithms analyzed, in the four collections studied and it was excelled when applied to judgments of Appeals Court.
References
- Agarwal, N. and Deep, P. (2014). Obtaining better software product by using test first programming technique. Proceedings of the 5th International Conference on Confluence 2014: The Next Generation Information Technology Summit, pages 742-747.
- Ahad, N. A., Yin, T. S., Othman, A. R., and Yaacob, C. R. (2011). Sensitivity of normality tests to non-normal data. Sains Malaysiana, 40(6):637-641.
- Alvares, R. V., Garcia, A. C. B., and Ferraz, I. (2005). STEMBR: A stemming algorithm for the Brazilian Portuguese language. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3808 LNCS:693-701.
- Basili, V. R., Caldiera, G., and Rombach, H. D. (1994). The goal question metric approach. Encyclopedia of Software Engineering, 2:528-532.
- Dunn, O. J. (1961). Multiple Comparisons Among Means. Journal of the American Statistical Association, 56(293):52-64.
- Flores, F. N. and Moreira, V. P. (2016). Assessing the impact of Stemming Accuracy on Information Retrieval A multilingual perspective. Information Processing & Management, 0:1-15.
- Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65-70.
- Jones, E., Oliphant, T., Peterson, P., and Others, A. (2001). SciPy: Open source scientific tools for Python.
- Lucene, A. (2005). A high-performance, full-featured text search engine library. URL: http://lucene.apache.org.
- Maximiliano, C. (2011). Hermeneˆutica e Aplicac¸a˜o do Direito. Forense, Rio de Janeiro, 20 edition.
- Orengo, V. M., Buriol, L. S., and Coelho, A. R. (2007). A Study on the Use of Stemming for Monolingual AdHoc Portuguese Information Retrieval. pages 91-98.
- Razali, N. M. and Wah, Y. B. (2011). Power comparisons of Shapiro-Wilk , Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. Journal of Statistical Modeling and Analytics, 2(1):21-33.
- Santos, W. (2001). Dicionário Jurídico Brasileiro . Livraria Del Rey Editora LTDA.
- SPSS, I. (2012). Statistical package for social science. USA: International Business Machines Corporation SPSS Statistics.
- Team, R. D. C. (2008). R: A Language and Environment for Statistical Computing. Technical report, R Foundation for Statistical Computing, Vienna, Austria.
- Theodorsson-Norheim, E. (1987). Friedman and quade tests: Basic computer program to perform nonparametric two-way analysis of variance and multiple comparisons on ranks of several related samples. Computers in biology and medicine, 17(2):85-99.
- Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., and Wesslén, A. (2012). Experimentation in Software Engineering. Springer Berlin Heidelberg, Berlin, Heidelberg.
Paper Citation
in Harvard Style
N. de Oliveira R. and Colaço Júnior M. (2017). Assessing the Impact of Stemming Algorithms Applied to Judicial Jurisprudence - An Experimental Analysis . In Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-247-9, pages 99-105. DOI: 10.5220/0006317100990105
in Bibtex Style
@conference{iceis17,
author={Robert A. N. de Oliveira and Methanias Colaço Júnior},
title={Assessing the Impact of Stemming Algorithms Applied to Judicial Jurisprudence - An Experimental Analysis},
booktitle={Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2017},
pages={99-105},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006317100990105},
isbn={978-989-758-247-9},
}
in EndNote Style
TY  - CONF 
JO  - Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI  - Assessing the Impact of Stemming Algorithms Applied to Judicial Jurisprudence - An Experimental Analysis
SN  - 978-989-758-247-9
AU  - N. de Oliveira R. 
AU  - Colaço Júnior M. 
PY  - 2017
SP  - 99
EP  - 105
DO  - 10.5220/0006317100990105