
The messages collected have characteristics that 
differ  from  formal  conversations.  Grammar  rules 
and  punctuations  are  rarely  used,  words  are  often 
spelt wrong, and there is a frequent use of slang and 
abbreviations.  This  lack  of  pattern  makes 
preprocessing  more  difficult  since  the  same  word 
can appear  in several  ways.  For  example,  the  user 
can repeat  some  specific  letter  to imitate a  spoken 
language as in "hiii" (the correct word is "hi"). Also, 
the words in this kind of  scenario  might be typing 
with  an  accent  missed,    by  mistake  of  the  user  or 
with  the  purpose  to  save  time.  Another  possible 
scenario  is  one  in  which  the  user  uses  an 
abbreviation instead of the word itself. 
There  are  other  problems,  such  as  punctuation. 
Punctuation  as  a  semicolon,  exclamation, 
interrogation,  points  and  commas  are  hardly  used. 
Even when they are used, there is no guarantee that 
they  are  used  correctly.  Another  problem  is  that 
these characters also appear in the use of emoticons, 
which are graphical representations of feelings, such 
as: ":-)" (happy), ": @" (angry) and ":-(" (unhappy). 
All  these  particularities  are  challenges  for  the 
classification  processing.  The  solution  adopted  in 
this paper was to identify the main problems and to 
create  a  sequence  of  methods  to  standardise  the 
sentences. Some of these methods were proposed in 
related works and others were created by the authors 
of this work. Methods such as lowercasing, stripping 
punctuation,  and  stemming  were  also  used  by 
(Morris,  2013).  They  have  proposed  replacing 
emoticons  and  proper  names,  although  this  paper 
differs  from  the  word  chosen  for  the  replacement. 
The  methods  to  remove  the  accents  and  the 
stopwords were also used by (Leite, 2015). 
The list of preprocessing techniques applied, as 
described in Figure 1: 
 
1. Capitalization: every character in lowercase. 
2. Links removal. 
3.  Identify  laughs  and  replace  with  a  single  word 
that represents laughter. The first part of this step is 
to identify a laugh. In Brazilian Portuguese, a laugh 
has more of a way of being written like: "hahaha", 
"kkkkk", "huehue". After identifying some of these 
several words that represent laughter, it is replaced 
by  a  unique  word  that  will  represent  laughter.  So 
different  forms  of  laughter  are  transformed  into  a 
simple  form.  Thus  the  algorithm  can  more  easily 
identify a word that represents a laugh. 
4. Exclusion of sequence of repeated characters. The 
user  can  miss  typing  and  put  several  characters  in 
sequence.  Sometimes  he/she  does  it  on  purpose  to 
express  a  way  of  speaking.  In  both  cases,  the 
repeated characters are transformed into one. 
5. Emoticon replacement by a word that represents 
that feeling. It is easier for the classifier to work only 
with  words.  For  example:  ":-)"  is  replaced  by 
"happy". 
6.  Accents,  punctuation  marks  and  numbers 
removal. 
7.  Surnames,  abbreviations  and  pejorative  words 
replacement  by  an  appropriate  word.  Some 
nicknames  and  abbreviations  are  well  known  and 
widely  used  in  chats,  so  these  nicknames  and 
abbreviations have been replaced by the word they 
represent. In this way either write abbreviations or 
the correct word, it will represent a unique word at 
the  end  of  this  step.  Some  pejorative  words  with 
several  variations  were  also  replaced  by  a  unique 
word. 
8. Stopwords removal. Stopwords are words that are 
not  relevant  for  the  classification  processing,  like 
articles and prepositions. 
9. Proper names removal. Proper names are also not 
relevant for this analysis. 
10.  Words  removal  with  one  character.  For  some 
typing  error,  the  user  may  have  typed  an  isolated 
character,  this  will  have  no  relevance  in  the 
classification. 
11.  Stemming.  In  Brazilian  Portuguese  some 
suffixes are added at the end of a word to generate 
another  word, called "derived  word". For example, 
consider the word "Cachorro" (dog in Portuguese). 
We  can also  have  the  words  "Cachorrinho"  (small 
dog  in  Portuguese),  "Cachorrão"  (big  dog  in 
Portuguese),  "Cachorra"  (female  dog).  All  these 
words  have  essentially  the  same  meaning.  In  the 
classification, we are only interested in the "root" of 
the  word.  The  words  "Cachorro",  "Cachorrinho", 
"Cachorrão",  "Cachorra",  will  be  transformed  into 
"Cachorr" ("root") at the end of this step. 
3.3  The Algorithm 
The  Naive  Bayes  algorithm  is  well  known  in 
machine  learning.  The  algorithm  uses  Bayes' 
theorem to  calculate the  probability of  an  attribute 
belonging to a particular class. It is called a "naive" 
because  it  assumes  that  the  attributes  are 
independent, a naive premise. In the text document 
classification  each  attribute  to be classified are the 
words in the document. 
The  Naive  Bayes  classifier  has  two  different 
models.  The  binary  model  where  the  document  is 
represented  by  a  vector  of  binary  attributes 
indicating  which  words  occur  and do  not  occur  in 
the  document.  The  number  of  times  the  word 
Textual Analysis for the Protection of Children and Teenagers in Social Media - Classification of Inappropriate Messages for Children and
Teenagers
659