PCA-BASED DATA MINING PROBABILISTIC AND FUZZY
APPROACHES WITH APPLICATIONS IN PATTERN
RECOGNITION
Luminita State
Dept. of Computer Science, University of Pitesti, Caderea Bastliei #45, Bucuresti – 1, Romania
Catalina Cocianu
Dept. of Computer Science, Academy of Economic Studies, Calea Dorobantilor #15-17, Bucuresti –1, Romania
Panayiotis Vlamos
Ionian University, Corfu, Greece
Viorica Stefanescu
Dept. of Mathematics, Academy of Economic Studies, Calea Dorobantilor #15-17, Bucuresti –1, Romania
Keywords: Principal component analysis, fuzzy
clustering, supervised learning, cluster analysis, pattern recognition,
data mining.
Abstract: The aim of the paper is to develop a new learning by examples PCA-based algorithm for extracting skeleton
information from data to assure both good recognition performances, and generalization capabilities. Here
the generalization capabilities are viewed twofold, on one hand to identify the right class for new samples
coming from one of the classes taken into account and, on the other hand, to identify the samples coming
from a new class. The classes are represented in the measuremen /feature space by continuous repartitions,
that is the model is given by the family of density functions
t
(
)
f
Hh
h
, where H stands for the finite set of
hypothesis (classes). The basis of the learning process is represented by samples of possible different sizes
coming from the considered classes. The skeleton of each class is given by the principal components
obtained for the corresponding sample.
1 PRINCIPAL COMPONENTS
The starting point for PCA is a n-dimensional
random vector X. There is available a sample X(1),
X(2),… ,X(T) from this random vector. No explicit
assumptions on the probability density of the vectors
are made in PCA, as long as the first – order and the
second –order statistics are known or can be
estimated from the sample. Also, no generative
model is assumed for vector X.
In the PCA transform, the vector X is
first
centered by subtracting its mean, X = X - E(X). In
practice, the mean of the n-dimensional vector X is
estimated from the available sample. In the
following, we assume that the vector X is centered.
Next, X is linearly transformed to another vector Y
with m elements, m<n, so that the redundancy
induced by the correlations is removed. The
transform consists in obtaining a rotated orthogonal
coordinate system such that the elements of X in the
new coordinates become uncorrelated. At the same
time, the variances of the projections of X on the
new coordinates become uncorrelated. At the same
time, the variances of the projections of X on the
new coordinate axes are maximized so that the first
axis corresponds to the maximal variance, the
second axis corresponds to the maximal variance in
the direction orthogonal to the first axis, and so on
(Hyvarinen, 2001).
55
State L., Cocianu C., Vlamos P. and Stefanescu V. (2006).
PCA-BASED DATA MINING PROBABILISTIC AND FUZZY APPROACHES WITH APPLICATIONS IN PATTERN RECOGNITION.
In Proceedings of the First International Conference on Software and Data Technologies, pages 55-60
Copyright
c
SciTePress
In mathematical terms, consider a linear
combination of the elements of the
vector X, .
n
xxx ,...,,
21
XW
T
n
i
ii
xwy
1
1
11
==
=
We look for a weight vector maximizing the
PCA criterion,
1
W
() ( )
1
1
2
1
2
1
EE SWWXW
TT
y =
=
(1)
1
1
=W
,
where the matrix S is the covariance matrix of X
given for the zero-mean vector X by the correlation
matrix.
It is well known from basic linear algebra that
the solution of the PCA problem is given in terms of
the unit-length eigen vectors of the matrix S,
. The ordering of the eigen vectors is
such that the corresponding eigen values
n
φφφ ,...,,
21
n
λ
λ
λ
,...,,
21
satisfy
n
λ
λ
λ
...
21
. The solution
maximizing is given by
and the first principal component of X is
. The PCA criterion can be generalized to
m principal components, . Let
be the m-th principal
component, with the corresponding unit norm
weight vector. The solution maximizing
under the constrain
, for is given by, (Hyvarinen,
2001) and the m-th principal component of
X is .
()
1
1
2
1
E SWW
T
y =
11
φW =
Xφ
T
y
11
=
nm 1
XW
T
m
n
i
iimm
xwy ==
=1
m
W
()
m
T
mm
y SWW=
2
E
()
0=
km
yyE mk <1
mm
φW =
Xφ
T
mm
y =
2 CLUSTER ANALYSIS
Cluster analysis is a data processing technique
aiming to identify the natural grouping trends
existing in a data collection, producing a set of
overlapping clusters, the elements belonging to the
same cluster sharing similar features. So far, there
have been proposed a relatively small number of
methods for testing the existence/inexistence of a
natural grouping tendency in a data collection, most
of them being based on arguments coming from
mathematical statistics and heuristic graphical
techniques (Panayirci and Dubes,1983, Smith and
Jain,1984,Jain and Dubes,1988, Tukey,1977,
Everitt,1978).
The data are represented by p-dimensional
vectors,
(
)
1
,...,
t
p
Xx x=
, whose components are the
feature values of a specified attributes and the
classification is performed against a certain given
label set. The classification of a data collection
{
}
1
,...,
n
XXℵ=
p
⊂ℜ
corresponds to a labelling
strategy of the objects of
.
In the fuzzy approaches, the clusters are
represented as fuzzy sets
()
,
,1
i
uic≤≤
[
]
:0,
i
u ℵ→ 1
)
, where is the
membership degree of to the i-th
cluster,
1
(
ik i k
uuX=
k
X
ic
,
1 kn
. A c-fuzzy partition is
represented by the matrix
ik c n
Uu M
×
=∈
. The
number of labels c has to be selected in advance, the
problem of finding the optimal c is usually referred
as cluster validation.
The main types of label vectors are crisp N
c
,
fuzzy N
p
, and possibilistic N
poz
, defined as follows,
(
){}
{
,1,0,,...,,,|
21
==
ic
c
c
yyyyyyyN
, (2)
{
c
c
i
i
eeeyci ,...,,1,1
21
1
=
=
=
}
where
()
0,
1,
iij
j
ij
e
ij
δ
==
=
(
)
[]
{
,1,0,,,...,,|
21
==
ic
c
p
yiyyyyyN
, (3)
=
=
1
1
c
i
i
y
(
)
[]
{
,1,0,,,...,,|
21
==
ic
c
poz
yiyyyyyN
}
0,
j
yj
, (4)
Obviously,
p
os p c
NNN
. If we denote by
[
]
1
,...,
ni
UU U u==
j
a partition of , then,
according to the types of label vectors, we get the c-
partition types M
pos
, M
p
and M
c
,
[]
{
,,...,,|
1 nncpos
UUUMUUM
=
=
×
(5)
>
=
0,,
1
,
n
k
ikposk
uiNUk
{
}
,,
p
pos k p
M
UU M kU N=∈
(6)
{
}
,,
cpkc
M
UU M kU N=∈
(7)
Note that
cppos
M
MM⊂⊂
.
ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES
56
3 C-MEANS MODEL
The most popular classification methods are the c-
means algorithms. The variational problem
corresponding to c-means model is given by
()
() ()
2
,
11 1 1
min , ; 1
cn c n
m
m
mikiki
UV
ik i k
JUVw uD w u
== = =
⎧⎫
=+
⎨⎬
⎩⎭
∑∑ ∑ ∑
ik
(8) where
//
c p pos
UMMM
,
(
)
1
,...,
ccp
Vv v M
×
=∈
i
v
, is
the centroid of the i-th cluster , is
the penalties vector corresponding to the cluster
system, is the fuzzyfication degree, and
()
1
,..,
T
c
www=
1m
2
2
ik k i
Dxv=−
.
Let us denote by
(
)
VU
ˆ
,
ˆ
a solution of (8). Then,
1. The crisp model:
()
,;0,1
ccpi
UV M M w i c
×
∈× =
,
=
otherwise,0
,,1
ˆ
jiDD
u
ijik
ik
(9)
1
1
ˆ
ˆ
ˆ
n
ik k
k
i
n
ik
k
ux
v
u
=
=
=
;
1,
(10)
1ic kn≤≤
2. The fuzzy model:
()
,;1,0,1
pcp i
UV M M m w i c
×
∈× > =
1
2
1
1
ˆ
m
c
ik
ik
j
jk
D
u
D
=
⎡⎤
⎛⎞
=
⎜⎟
⎢⎥
⎜⎟
⎝⎠
⎢⎥
⎣⎦
(11)
1
1
ˆ
n
m
ik k
k
i
n
m
ik
k
ux
v
u
=
=
=
;
1,
(12)
1ic kn≤≤
3. The possibilistic model:
()
,;
pos c p i
UV M M iw
×
∈× >,0
1
1
2
1
ˆ
1
m
ik
ik
i
D
u
w
⎡⎤
⎛⎞
⎢⎥
=+
⎜⎟
⎢⎥
⎝⎠
⎢⎥
⎣⎦
1
1
ˆ
n
m
ik k
k
i
n
m
ik
k
ux
v
u
=
=
=
;
1,
(13)
1ic kn≤≤
The general scheme of a cluster procedure
is ,
()
()
()
()( )
1
1
1
0
1
,,
tt
tt
tt
tt
t
repeat
tt
UFV
VGU
until t Tor V V
UV U V
ε
℘−
℘−
←+
=
−≤
where c is the given number of clusters, T is upper
limit on the number of iterations, m is the weight
parameter,
1 m
<∞
, C is the terminal condition, w
is the system of weights ,
,0
i
iw∀>
(
)
0 1,0 ,0
,...,
c
Vv v M
cp
×
=∈
is the initial system of
centroids and , are the updating functions.
F
G
4 PCA-BASED ALGORITHM FOR
EXTRACTING SKELETON
INFORMATION
In the following a new learning by examples PCA-
based algorithm for extracting skeleton information
from data to assure both good recognition
performances, and generalization capabilities, is
developed. Here the generalization capabilities are
viewed twofold, on one hand to identify the right
class for new samples coming from one of the
classes taken into account and, on the other hand, to
identify the samples coming from a new class. The
classes are represented in the measurement/feature
space by continuous repartitions, that is the model is
given by the family of density functions
()
,
where H stands for the finite set of hypothesis
(classes).
Hh
h
f
The basis of the learning process is represented
by samples of possible different sizes coming from
the considered classes. The skeleton of each class is
given by the principal components obtained for the
corresponding sample. The recognition algorithm
identifies the class whose skeleton is the “nearest” to
the tested example, where the closeness degree is
expressed in terms of the amount of disturbance
determined by the decision of allotting it to the
corresponding class. The model is presented as
follows. Let be a series of n-
dimensional vectors coming from a certain class C.
The sample covariance matrix is
N
XXX ,...,,
21
PCA-BASED DATA MINING PROBABILISTIC AND FUZZY APPROACHES WITH APPLICATIONS IN PATTERN
RECOGNITION
57
()(
=
=Σ
N
i
T
NiNiN
XX
N
1
1
1
μμ
)
, (14)
where
=
=
N
i
iN
X
N
1
1
μ
.
We denote by the eigen
values and by the corresponding
orthonormal eigen vectors of .
N
n
NN
λλλ
...
21
N
n
N
ψψ
,...,
1
N
Σ
If X
N+1
is a new sample, then, for the series
, we get
121
,,...,,
+NN
XXXX
()()
+
+Σ=Σ
+++
T
NNNNNN
XX
N
μμ
111
1
1
N
N
Σ
1
(15)
Lemma. In case the eigen values of are
pairwise distinct, the following first order
approximations hold,
N
Σ
() ()
N
iN
T
N
i
N
iN
T
N
i
N
i
N
i
ψΣψψΣψ
1
1
+
+
=Δ+=
λλ
(16)
()
=
+
Δ
+=
n
ij
j
N
j
N
j
N
i
N
iN
T
j
N
N
i
N
i
1
1
ψ
ψΣψ
ψψ
λλ
(17)
Proof Using the perturbation theory, we get,
and, ,
, . Then,
NNN
ΣΣΣ Δ+=
+1
N
i
N
i
N
i
ψψψ Δ+=
+1
N
i
N
i
N
i
λλλ
Δ+=
+1
ni 1
()()
N
T
NNNNN
N
N
ΣμXμXΣ
1
1
1
11
+
=Δ
++
(18)
()
(
)
=Δ+Δ+
N
i
N
iNN
ψψΣΣ
(
)
(
)
N
i
N
i
N
i
N
i
ψψ Δ+Δ+=
λλ
(19)
Using first order approximations, from (19) we get,
Δ+Δ+
N
iN
N
iN
N
i
N
i
ψΣψΣψ
λ
N
i
N
i
N
i
N
i
N
i
N
i
ψψψ
λλλ
Δ+Δ+
(20)
hence,
() ()
Δ+Δ
N
iN
T
N
i
N
iN
T
N
i
ψΣψψΣψ
()
2
N
i
N
i
N
i
T
N
i
N
i
ψψψ
λλ
Δ+Δ (21)
Using we obtain ,
()()
N
T
N
i
T
N
i
N
i
Σψψ =
λ
() ()
Δ+Δ
N
iN
T
N
i
N
i
T
N
i
N
i
ψΣψψψ
λ
()
N
i
N
i
T
N
i
N
i
λλ
Δ+Δ ψψ
(22)
hence that is,
()
N
iN
T
N
i
N
i
ψΣψ Δ=Δ
λ
() ()
N
iN
T
N
i
N
iN
T
N
i
N
i
N
i
ψΣψψΣψ
1
1
+
+
=Δ+=
λλ
(23)
The first order approximations of the orthonormal
eigen vectors of can be derived using the
expansion of each vector in the basis
represented by the orthonormal eigen vectors of
,
1+N
Σ
N
i
ψΔ
N
Σ
=
=Δ
n
j
N
jji
N
i
b
1
,
ψψ , (24)
where
(
)
N
i
T
N
jji
b ψψ Δ=
,
. (25)
Using the orthonormality, we get,
()( )
=Δ+Δ+=
N
i
T
N
i
N
i
N
i
N
i
ψψψψψ
21
22
(
)
(
)
N
i
T
N
i
ψψ
Δ+= 21
, (26)
that is
(
)
N
i
T
N
iii
b ψψ Δ=
,
=0 (27)
Using (19), the approximation,
N
i
N
i
N
i
N
i
N
iN
N
iN
ψψψΣψΣ
λλ
Δ+ΔΔ+Δ
. (28)
holds for each
ni
1
.
For
nij
1
, from (28) we obtain the following
equations,
(
)
(
)
Δ+Δ
N
iN
T
j
N
N
iN
T
j
N
ψΣψψΣψ
(
)
(
)
N
i
T
j
N
N
i
N
i
T
j
N
N
i
ψψψψ
λλ
Δ+Δ
(29)
(
)
(
)
Δ+Δ
N
iN
T
j
N
N
iN
T
j
N
ψΣψψΣψ
(
)
N
i
T
j
N
N
i
ψψ Δ
λ
(30)
(
)
(
)
Δ+Δ
N
iN
T
j
N
N
i
T
j
N
N
j
ψΣψψψ
λ
(
)
N
i
T
j
N
N
i
ψψ Δ
λ
(31)
From (31) we get,
(
)
(
)
(
)
N
iN
T
j
N
N
i
T
j
N
N
j
N
i
ψΣψψψ Δ=Δ
λλ
(32)
()
(
)
N
j
N
i
N
iN
T
j
N
N
i
T
N
jji
b
λλ
Δ
=Δ=
ψΣψ
ψψ
,
(33)
Consequently, the first order approximation of the
eigen vectors of are,
1+N
Σ
(
)
=
Δ
+=Δ+
n
ij
j
N
j
N
j
N
i
N
iN
T
j
N
N
i
N
i
N
i
1
ψ
ψΣψ
ψψψ
λλ
(34)
The skeleton of
C is represented by the set of
estimated principal components .When
the example
X
N
n
N
ψψ
,...,
1
N+1
is included in C, then the new
skeleton is . The skeleton disturbance
induced by the decision that
X
11
1
,...,
++ N
n
N
ψψ
N+1
has to be alloted to
C is measured by
ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES
58
(
=
+
=
n
k
N
k
N
k
d
n
D
1
1
,
1
ψψ
)
}
(35)
The crisp classification procedure identifies for
each example the closest cluster in terms of the
measure (35). Let . In order to
protect against misclassifications of samples coming
from new classes not belonging to
H, a threshold
T>0 is imposed, that is the example X
{
M
CCCH ,...,,
21
=
N+1
is alloted to
one of
C
j
for which
()
==
=
+
n
k
N
jk
N
jk
d
n
D
1
1
,,
,
1
ψψ
()
(36)
=
+
=
n
k
N
pk
N
pk
Mp
d
n
1
1
,,
1
,
1
min
ψψ
and
D<T, where the skeleton of C
j
is .
N
jn
N
j ,,1
,...,
ψψ
The classification of samples for which the
resulted value of
D is larger than T is postponed and
the samples are kept in a new possible class CR. The
reclassification of elements of CR is then performed
followed by the decision concerning to either
reconfigure the class system or to add CR as a new
class in
H.
In case of fuzzy classification, the value of the
membership degree of
X
N+1
to each cluster of
is computed as follows. Let
{}
CRHHT =
()
()
=
+
+
=
n
k
N
ik
N
ikiN
d
n
CXd
1
1
,,1
,
1
,
ψψ
, and
Mi 1
()
()
=
=
+
+
otherwise,
if,,
1
,
1
1
,,
1
CRd
n
CRXd
n
k
N
Rk
N
Rk
N
ψψ
,
where the skeleton of CR is represented by the set of
estimated principal components .
N
Rn
N
R ,,1
,...,
ψψ
()
()
S
CXd
X
iN
NC
i
,
1
1
1
+
+
=
μ
, (37)
Mi 1
()
()
()
=
+
+
+
otherwise,0
,if,
,
1
1
1
1
CRXd
S
CRXd
X
N
N
NCR
μ
(38),
where
()
(
)
()( )
=
=
+
+
+
+
CRXdCXd
CRXdCXd
S
N
HC
N
N
HTC
N
,if,,
,if,,
11
11
(39)
5 EXPERIMENTAL RESULTS
AND CONCLUDING REMARKS
Several tests were performed on simulated data and
they pointed out very successful performance of the
proposed classification strategy.
A series of tests were performed on 4-
dimensional simulated data coming from 5 clases
each of them having 50 examples. The classes are
represented by normal repartitions ~
()
i
C
ii
Σ
,
μ
,
51
i
, where
1
μ
= [10 11 2 -12]
=
Σ
1
1.8369 0.8010 1.5350 0.6460
0.8010 3.0600 3.1000 1.4720
1.5350 3.1000 3.7500 2.0100
0.6460 1.4720 2.0100 3.5944
2
μ
= [12 -5 8 13]
=
Σ
2
2.3961 1.0050 0.6720 0.3745
1.0050 2.4586 0.9305 0.5230
0.6720 0.9305 1.7766 0.7755
0.3745 0.5230 0.7755 1.5566
3
μ
= [-10 0 9 11]
=
Σ
3
2.3144 0.3250 0.4600 0.6000
0.3250 3.0725 0.0120 0.0200
0.4600 0.0120 2.5144 1.3740
0.6000 0.0200 1.3740 1.7300
4
μ
= [-3 14 3 -11.5]
=
Σ
4
4.1054 0.4250 0.6029 0.9805
0.4250 2.3724 0.9920 0.3814
0.6029 0.9920 2.4038 0.5825
0.9805 0.3814 0.5825 2.3618
5
μ
= [-7 -10.5 -14 11.5]
=
Σ
5
2.9441 0.3454 0.0210 0.6540
0.3454 1.6301 0.0240 0.6330
0.0210 0.0240 2.4436 0.2860
0.6540 0.6330 0.2860 1.8017
The classification criterion is: allote X
N+1
to if
i
j
C
(
=
+
=
l
j
ll
l
m
k
k
Nj
k
j
j
tl
d
m
D
1
1,
1
,
1
min
ψψ
)
(37)
In order to evaluate the generalization capacities,
100 new examples were generated for each
distribution. The results are presented in Table 1.
PCA-BASED DATA MINING PROBABILISTIC AND FUZZY APPROACHES WITH APPLICATIONS IN PATTERN
RECOGNITION
59
Table 1: Results on new simulated examples.
Class C
1
C
2
C
3
C
4
C
5
Number of
correct
classified
examples
100 100 96 99 100
Number of
missclassified
examples
0 0 4 -
allotted
to C
2
1 –
allotted
to C
1
0
The mean
v
alue of D in
c
ase of
c
orrect
c
lassifications
0.08 0.05 0.75 0.21 0.14
The
maximum
value of D in
case of
correct
classified
examples
0.41
0
.19 1.85 0.55 0.53
The evaluation of the generalization capacities in
case of examples coming from new classes was
performed on 1000 samples generated from
N
(
)
Σ,
μ
, where
=
μ
[0 11 -9 -9.5] and
=Σ
5.2261 1.6410 2.5561 1.3680
1.6410 6.0422 1.8390 1.7925
2.5561 1.8390 6.8986 3.1080
1.3680 1.7925 3.1080 8.2725
.
The admissibility criterion for allotting a sample
to a certain class is given by the maximum value of
D corresponding to correct classifications. The
results showed that about 975 examples were
classified in CR, that is the algorithm managed to
detect the intruded examples.
REFERENCES
Al Sultan K.S., Selim,S.Z., 1993. Global Algorithm for
Fuzzy Clustering Problem, Patt.Recogn. 26,1375-1361
Bensaid,A., Hall,L.O.,Bezdek,J.C., Clarke,L.P., 1996
Partially supervised clustering for image segmentation,
Patt.Recog.,29(5),859-871.
Bezdek,J.C., Keller,J., Krisnapuram,R., Pal,N.K., 2005.
Fuzzy Models and Algorithms for Pattern Recognition
and Image Processing, Springer Verlag
Bezdek, J.C., 1981, Pattern Recognition with Fuzzy
Objective Function Algorithms, Plenum Press
Bensaid, H., Bezdek,J.C., Clarke, L.P., Silbiger, M.L.,
Arrington, J.A., Murtagh, R.F., 1996. Validity-Guided
(Re)clustering with applications to image
segmentation, IEEE Trans. Fuzzy Systems, 4, 112-123
Clark,M., Hall,L.O., Goldgof,D.B., Clarke,L.P.,
Velthuizen,R.P., Silbiger,M.S., 1994. MRI
Segmentation using Fuzzy Clustering Techniques,
IEEE Engineering in Medicine and Biology, 1994
Everitt, B. S., 1978. Graphical Techniques for
Multivariate Data, North Holland, NY
Gath,J., Geva,A.B., 1989. Unsupervised optimal fuzzy
clustering, IEEE Trans. Pattern.Anal.Machine Intell,
11, 773-781
A. Hyvarinen, J. Karhunen, E. Oja, 2001. Independent
Component Analysis, John Wiley &Sons
Huang,C., Shi,Y. 2002. Towards Efficient Fuzzy
Information Processing, Physica-Verlag, Heidelberg
Jain,A.K., Dubes,R., 1988. Algorithms for Clustering
Data, Prentice Hall,Englewood Cliffs, NJ.
Jin,Y., 2003. Advanced Fuzzy Systems Design and
Applications, Physica-Verlag, Heidelberg
Krishnapuram, R., Keller,J.M., 1993. A possibilistic
approach to clustering, IEEE Trans. Fuzzy Syst., 1(2)
Li,C., Goldgof,D.B., Hall, L.O., 1993. Automatic
segmentation and tissue labelling of MR brain images,
IEEE Transactions on Medical Imaging, 12(4),1033-
1048
Pal,N.,R., Bezdek,J.C., 1995. On Cluster validity for the
Fuzzy c-Means Model, IEEE Trans. On Fuzzy Syst.
,Vol. 3,no.3
Smith,S.P. ,Jain,A.K., 1984. Testing for uniformity in
multidimensional data, IEEE Trans.`Patt. Anal.`amd
Machine Intell., 6(1),73-81
Wu,K-L., Yang,M-S, 2005. A Cluster validity index for
fuzzy clustrering, Patt.Recog.Lett. 26, 1275-1291
Zahid,N., Abouelala,O., Limouri,M., Essaid,A., 1999.
Unsupervised fuzzy clustering, Patt.Recog.Lett., 20,1
ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES
60