MERGING OF DATA KNOWLEDGE IN BAYESIAN ESTIMATION
Jan Krac
´
ık
Institute of Information Theory and Automation
P.O.box 18, 182 08 Praha 8, Czech Republic
Miroslav K
´
arn
´
y
Institute of Information Theory and Automation
P.O.box 18, 182 08 Praha 8, Czech Republic
Keywords:
Bayesian estimation, prior information, multiple-participant decision making.
Abstract:
Efficient multiple participant decision-making relies on cooperation of participants. Partially, it is reached by
sharing knowledge. A specific but important case of this type is addressed here. Essentially, a participant
passes to its partner distribution on common data and partner uses it for correcting its Bayesian parameter
estimate.
1 INTRODUCTION
Decision making (DM) is the ultimate purpose of any
cognitive system serving at various scales and do-
mains: international, state or local-community lev-
els; particular technical, medical and societal orga-
nizations; individual human beings etc. Attempts to
optimize centrally the overall performance of a col-
lection of mutually interacting participants reach soon
the communication and evaluation complexity barri-
ers. Use of distributed DM methodologies is then the
only viable way towards desirable efficiency. Existing
solutions overcome the complexity barrier by exploit-
ing specificity of their application domains. Their
transfer to different domains is, however, expensive
in skilled manpower. None of them is able to serve as
a common domain-independent pattern and thus the
real need for applicable theory of distributed DM per-
sists.
Careful inspection of DM (Savage, 1954; Berger,
1985) identifies the Bayesian theory as a prime can-
didate. Practical consequence, relevant to this paper,
is that different subjects of distributed DM (partici-
pants) share a probabilistic information when cooper-
ating. Existing approaches to a combination of low
dimensional pdfs suffer from a significant ambigu-
ity, e.g. (Jirou
ˇ
sek, 2003; Meneguzzo and Vecchiato,
2004). Furthermore, these approaches can be hardly
integrated into the Bayesian framework. This moti-
vated a research whose part is presented in this pa-
per. Solution of a partial but important task using
of probabilistically described knowledge of data pro-
vided by another participant for improving Bayesian
parameter estimation – is presented.
2 PROBLEM FORMULATION
A participant estimates an unknown finite-
dimensional parameter Θ determining the para-
meterized model m
t
, Θ) f(y
t
|ψ
t
, Θ)
f(y
t
|u
t
, d(t 1), Θ), where f (·|·) is a conditional
probability density function (pdf). In it, the modelled
system output y
t
depends on a system input u
t
and
past data history d(t 1) (d
0
, d
1
, . . . , d
t1
),
d
τ
= (y
τ
, u
τ
), via a finite-dimensional regression
vector ψ
t
only. The data vector Ψ
t
is coupling of
the modelled output y
t
and of the corresponding
regression vector ψ
t
. Prior information, labelled d
0
,
is attached to the observed sequence d
1
, . . . , d
t1
.
The participant estimates Θ in Bayesian way, i.e.,
evaluates the posterior pdf
f|d(t)) f(Θ)
t
Y
τ =1
m
τ
, Θ). (1)
The symbol expresses equality without writing
the normalizing data-dependent proportionality fac-
tor. The prior pdf f(Θ) f |d
0
) is related to the
posterior pdf by the above version of Bayes rule iff
the parameter Θ is unknown to the input generator,
i.e.,
f(u
t
|d(t 1), Θ) = f(u
t
|d(t 1)) (Peterka, 1981).
229
Kracík J. and Kárný M. (2005).
MERGING OF DATA KNOWLEDGE IN BAYESIAN ESTIMATION.
In Proceedings of the Second International Conference on Informatics in Control, Automation and Robotics - Signal Processing, Systems Modeling and
Control, pages 229-232
DOI: 10.5220/0001174502290232
Copyright
c
SciTePress
Another participant is assumed to deal with phys-
ically the same data d(t) (possibly different real-
izations) and generate their joint pdf f(d(t)) =
Q
t
τ =1
f(d
t
|d(t 1)) and evaluate marginal pdfs
M
τ
) of data vectors. For simplicity of presenta-
tion, we assume that this function is time invariant.
The pdfs f (d
t
|d(t 1)) can be, for instance, out-
put predictors obtained via Bayesian estimation and
prediction of a model, which differs from m
t
, Θ).
This participant provides its knowledge of M
t
) to
the former one. Another possibility is to interpret
M
t
) as an additional information provided by an
expert. Question arises how this information can be
used for correcting the posterior pdf of Θ. An answer
to this question is the problem addressed within the
paper.
3 SUFFICIENT STATISTIC FOR
ANY PARAMETERIZED
MODEL
The Bayesian parameter estimation is described by
the Bayes rule (1). It can be rewritten as follows
f|d(t)) f(Θ) exp
"
t
X
τ =1
ln(m
τ
, Θ))
#
=
= exp
"
Z
t
X
τ =1
δ Ψ
τ
) ln(m, Θ)) dΨ
#
. (2)
The expression
P
t
τ =1
δ Ψ
τ
), determined by the
Dirac delta function, can be interpreted as t-multiple
of the “empirical” pdf on set Ψ
of possible data vec-
tors Ψ. Formally clean version is obtained by the cor-
rect interpretation of
R
δ Ψ
τ
)g(Ψ) dΨ as the lin-
ear functional assigning to a function g(Ψ) its value in
Ψ
τ
. The quotation marks at the term empirical distri-
bution stress that against the traditional assumptions
the involved data vectors are statistically dependent.
The presented form of the posterior pdf has an im-
portant consequence: the number of data records to-
gether with the empirical pdf of data vectors form a
sufficient statistic for estimation of any parameterized
model that deals with the data vectors {Ψ
t
}. Further-
more, updating posterior pdf f|d(t)) by other data
records, say d
t+1
, . . . , d
¯
t
, is equivalent to adding suf-
ficient statistic corresponding to d
t+1
, . . . , d
¯
t
to the
statistic
P
t
τ =1
δ Ψ
τ
).
4 MERGING DATA BASED
KNOWLEDGE
The observations made in the previous section deter-
mine the way to incorporate knowledge expressed by
M(Ψ) into the parametric estimation connected with
the model m, Θ). Taking information M(Ψ) as a
pdf of, say ν, virtual observations, the sufficient sta-
tistic for the posterior pdf f |d(t), M, ν) based on
both real and virtual observations is determined by
t + ν data records with the pdf
1
t + ν
t
X
τ =1
δ Ψ
τ
) +
ν
t + ν
M(Ψ)
in the place of the empirical pdf. Note that the idea
of virtual data is quite common, e.g. (K
´
arn
´
y et al.,
2001). For instance, Bayesian estimation with a con-
jugate prior pdf is often interpreted as estimation with
additional virtual data (determining the original prior)
and a uniform prior pdf.
Contrary to M(Ψ), the weight ν assigned to the
information M (Ψ) is not supposed to be given. Gen-
erally, it is subjectively assigned by the the participant
making the parametric estimation, and expresses the
weight it gives to the participant serving as an infor-
mation source.
Using this way, we get the parameter estimate that
respects both knowledge sources
f|d(t), M, ν) f(Θ) exp{ (3)
Z
"
t
X
τ =1
δ Ψ
τ
) + νM (Ψ)
#
ln(m, Θ)) dΨ}
f |d(t)) exp
ν
Z
M(Ψ) ln(m, Θ)) dΨ
.
Remarks
1. In the proposed method, the information M(Ψ) is
processed “data-like” in the following sense. Sup-
pose that M(Ψ) is an empirical density from ν data
records, i.e., M(Ψ) =
1
ν
P
ν
τ =1
δ Ψ
τ
), and
data vectors Ψ
1
, . . . , Ψ
ν
arise from a sequence of
data d(ν). Then,
f|M, ν) = f|d(ν)).
2. An intuitive way to use information M (Ψ) as ν
data records is to generate ν random samples from
M(Ψ) and evaluate the posterior pdf with these
samples. For sufficiently large ν such posterior pdf
is expected to be close to the posterior f|M, ν)
as the empirical distribution converges to the real
one. However, for small ν the posterior pdf based
on the random samples strongly depends on their
realization while f|M, ν) is not influenced by
any randomness.
ICINCO 2005 - SIGNAL PROCESSING, SYSTEMS MODELING AND CONTROL
230
3. The “merging” weights are controlled by the op-
tional scalar ν > 0.
4. It is worth stressing that the function M
t
) is to
be joint pdf of the output y
t
and the regression vec-
tor ψ
t
similarly as in the case of independent Ψs.
5 EXAMPLES IN EXPONENTIAL
FAMILY
Let us consider a parameterized model in the expo-
nential family (Barndorff-Nielsen, 1978)
m, Θ) = A(Θ) exp hB(Ψ), C(Θ)i, (4)
where the functions A, B, C are known functions of
respective arguments. A(Θ) 0 is the scalar one,
B, C are vectorial functions of compatible dimen-
sions and hB(Ψ), C(Θ)i is a functional linear in the
first argument.
Let us suppose that the function M(Ψ) defines well
the expectation
V
Z
M(Ψ)B(Ψ) dΨ. (5)
Then, the factor modifying the prior pdf has the con-
jugated form
g, ν, V ) A(Θ)
ν
exp hνV, C(Θ)i. (6)
If the prior pdf is also chosen conjugated one
f(Θ) =
g, ¯ν,
¯
V )
I(¯ν,
¯
V )
, I(ν, V ) =
Z
g, ν, V ) dΘ,
(7)
then the posterior pdfs have the same fixed functional
form given by g, ν
t
, V
t
) with the statistics ν
t
, V
t
evolving as follows
ν
t
= ν
t1
+ 1, V
t
= V
t1
+ B
t
), (8)
ν
0
= ¯ν + ν, V
0
=
¯
V + νV.
Thus, the externally supplied pdf M(Ψ) adds ν and
V to the initial values of the statistics selected by the
participant that runs the parameter estimation.
If the DM task allows us to wait for collecting the
statistics
¯
V
t
=
P
t
τ =1
B
τ
) +
¯
V and ¯ν
t
= t + ¯ν
for some realization of data vectors, it is possible to
select the optimal weight ν
o
by maximizing the cor-
responding posterior likelihood function
ν
o
= argmax
ν
I(ν + ¯ν
t
,
¯
V
t
+ νV )
I(ν, νV )
, (9)
where ¯ν
t
= t + ¯ν and
¯
V
t
=
P
t
τ =1
B
τ
) +
¯
V .
If we cannot wait, several competitive values of
ν have to be chosen and the corresponding posterior
likelihoods compared in recursive mode.
Normal ARX model is the most prominent example
of a dynamic model in the exponential family. It is
described by the parameterized model
m, Θ) N
y
t
(θ
ψ
t
, r) (10)
=
1
2πr
|
{z }
A(Θ)
exp tr
Ψ
t
Ψ
t
|
{z}
B
t
)
1
2r
[1, θ
][1, θ
]
|
{z }
C(Θ)
|
{z }
<·,·>
where N
y
(µ, ρ) is normal pdf with mean µ and vari-
ance ρ; the regression coefficients θ and variance form
the unknown parameter Θ, tr(N ) is the trace of matrix
N, and
denotes transposition.
The marked correspondence with exponential fam-
ily shows that the moments needed in connection with
M(Ψ) are the non-central second moments of the data
vector Ψ
V =
Z
M(Ψ)ΨΨ
dΨ. (11)
The updating (8), describing completely the posterior
pdfs in the conjugate Gauss-inverse-Wishart form,
can be shown to be algebraically equivalent to recur-
sive least-squares algorithm (Peterka, 1981). The in-
formation from the second participant simply mod-
ifies its initial conditions. Their careful choice is
known to influence substantially the transient behav-
ior of the algorithm. Often, it is vital, especially in
closed decision-making (control) loop.
Controlled Markov chain is another example of the
model describing well dynamic systems. It mod-
els discrete-valued outputs that depend on discrete-
valued regression vector by the table
f(y
t
|u
t
, d(t 1), Θ) = m
t
, Θ) (12)
Θ
y
t
|ψ
t
= exp
X
ΨΨ
δ Ψ
t
)
|
{z }
B
Ψ
t
)
ln(Θ
y|ψ
)
|
{z }
C
Ψ
(Θ)
|
{z }
<·,·>
whose entries Θ
y|ψ
form the unknown parameter Θ.
The parameter belongs to a subset (determined possi-
bly by some additional information) of the convex set
Θ
n
Θ
y|ψ
: Θ
y|ψ
0,
P
yy
Θ
y|ψ
= 1
o
.
The externally supplied model M(Ψ) simply as-
signs probabilities to various possible values of Ψ
Ψ
and the factor modifying the prior pdf has the form
exp
ν
X
ΨΨ
M(Ψ) ln(Θ
y|ψ
)
!
=
Y
ΨΨ
Θ
νM (Ψ)
y|ψ
.
(13)
This expression is proportional to the conjugate
Dirichlet pdf determined by the table νM (Ψ) that
MERGING OF DATA KNOWLEDGE IN BAYESIAN ESTIMATION
231
can be interpreted as the number of occurrences of
the data vector Ψ. Choosing the prior pdf f(Θ) in
the Dirichlet form
Q
ΨΨ
Θ
¯
V
y|ψ
1
y|ψ
, the externally
supplied information increases it to the initial value
V
0
=
¯
V + νM . The posterior pdf is also Dirichlet
one given by the occurrence table V
t
. It evolves start-
ing from the initial value V
0
. The updating by the
observed data V
t
= V
t1
+ B
t
) adds the number
of occurrences of the values Ψ
τ
= Ψ, τ t, to the
Ψth entry of the table V
0
.
Again, importance of the prior knowledge can be
hardly over-stressed: the estimation of controlled
Markov chains is formally extremely simple but the
dimension of the occurrence table V grows exponen-
tially with the cardinality of the set Ψ
. Consequently,
there is a lack of data in majority practical cases and,
moreover, their information content is as a rule insuf-
ficient.
6 CONCLUSIONS
The simple presented result is a quite powerful and
practical tool. Considering the parameterized model
m, Ψ) from the exponential family and a conju-
gate prior pdf, the posterior pdf f |M, ν) remains
in the conjugate form as it is in “proper” Bayesian es-
timation. Evaluation of
R
M(Ψ) ln(m, Θ)) dΨ of-
ten reduces into evaluation of moments of Ψ. More-
over, a simulation model of a quite different nature
than the estimated one can be used for estimating
R
M(Ψ) ln(m, Θ)) dΨ. In this case, the use of
M(Ψ) is often reduced into evaluation of sample mo-
ments of Ψ.
More complex models probabilistic mixtures
can be estimated by the proposed method using, e.g.,
a slightly modified quasi-Bayes algorithm (K
´
arn
´
y
et al., 2005).
Physically motivated models, black box models of
much higher order than the estimated one, relation-
ships described by production rules stimulated by real
data in past etc. may serve as data-vectors sources.
In this way, model simplification and translation be-
tween various knowledge domains are addressed in a
justified, purposeful and simple way.
The choice of the weight ν is an algorithmically
open problem. Its solution is, however, predictable:
Bayesian hypothesis testing and real data observed by
the participant should provide flexible universal solu-
tion.
Similarly, the case when information about some
entries of Ψ
t
is only offered is unsolved. It is expected
that assignment the extension of M to the set Ψ
by a
very flat marginal on “non-reported” entries of Ψ will
solve this problem.
Even with this open problems pending, the pro-
posed “technology” is straightforward to pass un-
certain knowledge from one participant to another
one and thus to combine very different knowledge
sources.
ACKNOWLEDGEMENTS
This research has been partially supported by
GA
ˇ
CR grant 102/03/0049, AV
ˇ
CR project BADDYR
1ET100750401, M
ˇ
SMT grant 1M6798555601 DAR,
and AV
ˇ
CR project 1ET100750404.
REFERENCES
Barndorff-Nielsen, O. (1978). Information and exponential
families in statistical theory. Wiley, New York.
Berger, J. (1985). Statistical Decision Theory and Bayesian
Analysis. Springer-Verlag, New York.
Jirou
ˇ
sek, R. (2003). On experimental system for multidi-
mensional model development MUDIN. Neural Net-
work World, (5):513–520.
K
´
arn
´
y, M., B
¨
ohm, J., Guy, T., Jirsa, L., Nagy, I., Nedoma,
P., and Tesa
ˇ
r, L. (2005). Optimized Bayesian Dynamic
Advising: Theory and Algorithms. Springer, London.
to appear.
K
´
arn
´
y, M., Khailova, N., Nedoma, P., and B
¨
ohm, J.
(2001). Quantification of prior information revised.
International Journal of Adaptive Control and Signal
Processing, 15(1):65–84.
Meneguzzo, D. and Vecchiato, W. (2004). Copula sensitiv-
ity in collateralized debt obligations and basket default
swaps. Journal of Futures Markets, 24(1):37–70.
Peterka, V. (1981). Bayesian system identification. In
Eykhoff, P., editor, Trends and Progress in System
Identification, pages 239–304. Pergamon Press, Ox-
ford.
Savage, L. (1954). Foundations of Statistics. Wiley, New
York.
ICINCO 2005 - SIGNAL PROCESSING, SYSTEMS MODELING AND CONTROL
232