MERGING OF DATA KNOWLEDGE IN BAYESIAN ESTIMATION

Jan Krac

ık

Institute of Information Theory and Automation

P.O.box 18, 182 08 Praha 8, Czech Republic

Miroslav K

arn

Institute of Information Theory and Automation

P.O.box 18, 182 08 Praha 8, Czech Republic

Keywords:

Bayesian estimation, prior information, multiple-participant decision making.

Abstract:

Efﬁcient multiple participant decision-making relies on cooperation of participants. Partially, it is reached by

sharing knowledge. A speciﬁc but important case of this type is addressed here. Essentially, a participant

passes to its partner distribution on common data and partner uses it for correcting its Bayesian parameter

estimate.

1 INTRODUCTION

Decision making (DM) is the ultimate purpose of any

cognitive system serving at various scales and do-

mains: international, state or local-community lev-

els; particular technical, medical and societal orga-

nizations; individual human beings etc. Attempts to

optimize centrally the overall performance of a col-

lection of mutually interacting participants reach soon

the communication and evaluation complexity barri-

ers. Use of distributed DM methodologies is then the

only viable way towards desirable efﬁciency. Existing

solutions overcome the complexity barrier by exploit-

ing speciﬁcity of their application domains. Their

transfer to different domains is, however, expensive

in skilled manpower. None of them is able to serve as

a common domain-independent pattern and thus the

real need for applicable theory of distributed DM per-

sists.

Careful inspection of DM (Savage, 1954; Berger,

1985) identiﬁes the Bayesian theory as a prime can-

didate. Practical consequence, relevant to this paper,

is that different subjects of distributed DM (partici-

pants) share a probabilistic information when cooper-

ating. Existing approaches to a combination of low

dimensional pdfs suffer from a signiﬁcant ambigu-

ity, e.g. (Jirou

sek, 2003; Meneguzzo and Vecchiato,

2004). Furthermore, these approaches can be hardly

integrated into the Bayesian framework. This moti-

vated a research whose part is presented in this pa-

per. Solution of a partial but important task – using

of probabilistically described knowledge of data pro-

vided by another participant for improving Bayesian

parameter estimation – is presented.

2 PROBLEM FORMULATION

A participant estimates an unknown ﬁnite-

dimensional parameter Θ determining the para-

meterized model m(Ψ

, Θ) ≡ f(y

|ψ

, Θ) ≡

f(y

, d(t − 1), Θ), where f (·|·) is a conditional

probability density function (pdf). In it, the modelled

system output y

depends on a system input u

and

past data history d(t − 1) ≡ (d

, d

, . . . , d

t−1

= (y

, u

), via a ﬁnite-dimensional regression

vector ψ

only. The data vector Ψ

is coupling of

the modelled output y

and of the corresponding

regression vector ψ

. Prior information, labelled d

is attached to the observed sequence d

, . . . , d

t−1

The participant estimates Θ in Bayesian way, i.e.,

evaluates the posterior pdf

f(Θ|d(t)) ∝ f(Θ)

τ =1

m(Ψ

, Θ). (1)

The symbol ∝ expresses equality without writing

the normalizing data-dependent proportionality fac-

tor. The prior pdf f(Θ) ≡ f (Θ|d

) is related to the

posterior pdf by the above version of Bayes rule iff

the parameter Θ is unknown to the input generator,

i.e.,

f(u

|d(t − 1), Θ) = f(u

|d(t − 1)) (Peterka, 1981).

229

Kracík J. and Kárný M. (2005).

MERGING OF DATA KNOWLEDGE IN BAYESIAN ESTIMATION.

In Proceedings of the Second International Conference on Informatics in Control, Automation and Robotics - Signal Processing, Systems Modeling and

Control, pages 229-232

DOI: 10.5220/0001174502290232

 SciTePress

Another participant is assumed to deal with phys-

ically the same data d(t) (possibly different real-

izations) and generate their joint pdf f(d(t)) =

τ =1

f(d

|d(t − 1)) and evaluate marginal pdfs

M(Ψ

) of data vectors. For simplicity of presenta-

tion, we assume that this function is time invariant.

The pdfs f (d

|d(t − 1)) can be, for instance, out-

put predictors obtained via Bayesian estimation and

prediction of a model, which differs from m(Ψ

, Θ).

This participant provides its knowledge of M(Ψ

) to

the former one. Another possibility is to interpret

M(Ψ

) as an additional information provided by an

expert. Question arises how this information can be

used for correcting the posterior pdf of Θ. An answer

to this question is the problem addressed within the

paper.

3 SUFFICIENT STATISTIC FOR

ANY PARAMETERIZED

MODEL

The Bayesian parameter estimation is described by

the Bayes rule (1). It can be rewritten as follows

f(Θ|d(t)) ∝ f(Θ) exp

τ =1

ln(m(Ψ

, Θ))

= exp

τ =1

δ(Ψ − Ψ

) ln(m(Ψ, Θ)) dΨ

. (2)

The expression

τ =1

δ(Ψ − Ψ

), determined by the

Dirac delta function, can be interpreted as t-multiple

of the “empirical” pdf on set Ψ

∗

of possible data vec-

tors Ψ. Formally clean version is obtained by the cor-

rect interpretation of

δ(Ψ −Ψ

)g(Ψ) dΨ as the lin-

ear functional assigning to a function g(Ψ) its value in

. The quotation marks at the term empirical distri-

bution stress that against the traditional assumptions

the involved data vectors are statistically dependent.

The presented form of the posterior pdf has an im-

portant consequence: the number of data records to-

gether with the empirical pdf of data vectors form a

sufﬁcient statistic for estimation of any parameterized

model that deals with the data vectors {Ψ

}. Further-

more, updating posterior pdf f(Θ|d(t)) by other data

records, say d

t+1

, . . . , d

, is equivalent to adding suf-

ﬁcient statistic corresponding to d

t+1

, . . . , d

to the

statistic

τ =1

δ(Ψ − Ψ

4 MERGING DATA BASED

KNOWLEDGE

The observations made in the previous section deter-

mine the way to incorporate knowledge expressed by

M(Ψ) into the parametric estimation connected with

the model m(Ψ, Θ). Taking information M(Ψ) as a

pdf of, say ν, virtual observations, the sufﬁcient sta-

tistic for the posterior pdf f (Θ|d(t), M, ν) based on

both real and virtual observations is determined by

t + ν data records with the pdf

t + ν

τ =1

δ(Ψ − Ψ

) +

t + ν

M(Ψ)

in the place of the empirical pdf. Note that the idea

of virtual data is quite common, e.g. (K

arn

y et al.,

2001). For instance, Bayesian estimation with a con-

jugate prior pdf is often interpreted as estimation with

additional virtual data (determining the original prior)

and a uniform prior pdf.

Contrary to M(Ψ), the weight ν assigned to the

information M (Ψ) is not supposed to be given. Gen-

erally, it is subjectively assigned by the the participant

making the parametric estimation, and expresses the

weight it gives to the participant serving as an infor-

mation source.

Using this way, we get the parameter estimate that

respects both knowledge sources

f(Θ|d(t), M, ν) ∝ f(Θ) exp{ (3)

τ =1

δ(Ψ − Ψ

) + νM (Ψ)

ln(m(Ψ, Θ)) dΨ}

∝ f (Θ|d(t)) exp



M(Ψ) ln(m(Ψ, Θ)) dΨ



Remarks

1. In the proposed method, the information M(Ψ) is

processed “data-like” in the following sense. Sup-

pose that M(Ψ) is an empirical density from ν data

records, i.e., M(Ψ) =

τ =1

δ(Ψ − Ψ

), and

data vectors Ψ

, . . . , Ψ

arise from a sequence of

data d(ν). Then,

f(Θ|M, ν) = f(Θ|d(ν)).

2. An intuitive way to use information M (Ψ) as ν

data records is to generate ν random samples from

M(Ψ) and evaluate the posterior pdf with these

samples. For sufﬁciently large ν such posterior pdf

is expected to be close to the posterior f(Θ|M, ν)

as the empirical distribution converges to the real

one. However, for small ν the posterior pdf based

on the random samples strongly depends on their

realization while f(Θ|M, ν) is not inﬂuenced by

any randomness.

ICINCO 2005 - SIGNAL PROCESSING, SYSTEMS MODELING AND CONTROL

230

3. The “merging” weights are controlled by the op-

tional scalar ν > 0.

4. It is worth stressing that the function M(Ψ

) is to

be joint pdf of the output y

and the regression vec-

tor ψ

similarly as in the case of independent Ψs.

5 EXAMPLES IN EXPONENTIAL

FAMILY

Let us consider a parameterized model in the expo-

nential family (Barndorff-Nielsen, 1978)

m(Ψ, Θ) = A(Θ) exp hB(Ψ), C(Θ)i, (4)

where the functions A, B, C are known functions of

respective arguments. A(Θ) ≥ 0 is the scalar one,

B, C are vectorial functions of compatible dimen-

sions and hB(Ψ), C(Θ)i is a functional linear in the

ﬁrst argument.

Let us suppose that the function M(Ψ) deﬁnes well

the expectation

V ≡

M(Ψ)B(Ψ) dΨ. (5)

Then, the factor modifying the prior pdf has the con-

jugated form

g(Θ, ν, V ) ≡ A(Θ)

exp hνV, C(Θ)i. (6)

If the prior pdf is also chosen conjugated one

f(Θ) =

g(Θ, ¯ν,

V )

I(¯ν,

V )

, I(ν, V ) =

g(Θ, ν, V ) dΘ,

(7)

then the posterior pdfs have the same ﬁxed functional

form given by g(Θ, ν

, V

) with the statistics ν

, V

evolving as follows

= ν

t−1

+ 1, V

= V

t−1

+ B(Ψ

), (8)

= ¯ν + ν, V

V + νV.

Thus, the externally supplied pdf M(Ψ) adds ν and

V to the initial values of the statistics selected by the

participant that runs the parameter estimation.

If the DM task allows us to wait for collecting the

statistics

τ =1

B(Ψ

) +

V and ¯ν

= t + ¯ν

for some realization of data vectors, it is possible to

select the optimal weight ν

by maximizing the cor-

responding posterior likelihood function

= argmax

I(ν + ¯ν

+ νV )

I(ν, νV )

, (9)

where ¯ν

= t + ¯ν and

τ =1

B(Ψ

) +

V .

If we cannot wait, several competitive values of

ν have to be chosen and the corresponding posterior

likelihoods compared in recursive mode.

Normal ARX model is the most prominent example

of a dynamic model in the exponential family. It is

described by the parameterized model

m(Ψ, Θ) ≡ N

(θ

′

, r) (10)

√

2πr

{z }

A(Θ)

exp tr







′

{z}

B(Ψ

)

−1

[−1, θ

′

][−1, θ

′

]

′

{z }

C(Θ)







{z }

<·,·>

where N

(µ, ρ) is normal pdf with mean µ and vari-

ance ρ; the regression coefﬁcients θ and variance form

the unknown parameter Θ, tr(N ) is the trace of matrix

N, and

′

denotes transposition.

The marked correspondence with exponential fam-

ily shows that the moments needed in connection with

M(Ψ) are the non-central second moments of the data

vector Ψ

V =

M(Ψ)ΨΨ

′

dΨ. (11)

The updating (8), describing completely the posterior

pdfs in the conjugate Gauss-inverse-Wishart form,

can be shown to be algebraically equivalent to recur-

sive least-squares algorithm (Peterka, 1981). The in-

formation from the second participant simply mod-

iﬁes its initial conditions. Their careful choice is

known to inﬂuence substantially the transient behav-

ior of the algorithm. Often, it is vital, especially in

closed decision-making (control) loop.

Controlled Markov chain is another example of the

model describing well dynamic systems. It mod-

els discrete-valued outputs that depend on discrete-

valued regression vector by the table

f(y

, d(t − 1), Θ) = m(Ψ

, Θ) ≡ (12)

≡ Θ

|ψ

= exp







Ψ∈Ψ

∗

δ(Ψ − Ψ

)

{z }

(Ψ

)

ln(Θ

y|ψ

)

{z }

(Θ)







{z }

<·,·>

whose entries Θ

y|ψ

form the unknown parameter Θ.

The parameter belongs to a subset (determined possi-

bly by some additional information) of the convex set

∗

≡

y|ψ

: Θ

y|ψ

≥ 0,

y∈y

∗

y|ψ

= 1

The externally supplied model M(Ψ) simply as-

signs probabilities to various possible values of Ψ ∈

∗

and the factor modifying the prior pdf has the form

exp

Ψ∈Ψ

∗

M(Ψ) ln(Θ

y|ψ

)

Ψ∈Ψ

∗

νM (Ψ)

y|ψ

(13)

This expression is proportional to the conjugate

Dirichlet pdf determined by the table νM (Ψ) that

MERGING OF DATA KNOWLEDGE IN BAYESIAN ESTIMATION

231

can be interpreted as the number of occurrences of

the data vector Ψ. Choosing the prior pdf f(Θ) in

the Dirichlet form ∝

Ψ∈Ψ

∗

y|ψ

−1

y|ψ

, the externally

supplied information increases it to the initial value

V + νM . The posterior pdf is also Dirichlet

one given by the occurrence table V

. It evolves start-

ing from the initial value V

. The updating by the

observed data V

= V

t−1

+ B(Ψ

) adds the number

of occurrences of the values Ψ

= Ψ, τ ≤ t, to the

Ψth entry of the table V

Again, importance of the prior knowledge can be

hardly over-stressed: the estimation of controlled

Markov chains is formally extremely simple but the

dimension of the occurrence table V grows exponen-

tially with the cardinality of the set Ψ

∗

. Consequently,

there is a lack of data in majority practical cases and,

moreover, their information content is as a rule insuf-

ﬁcient.

6 CONCLUSIONS

The simple presented result is a quite powerful and

practical tool. Considering the parameterized model

m(Θ, Ψ) from the exponential family and a conju-

gate prior pdf, the posterior pdf f (Θ|M, ν) remains

in the conjugate form as it is in “proper” Bayesian es-

timation. Evaluation of

M(Ψ) ln(m(Ψ, Θ)) dΨ of-

ten reduces into evaluation of moments of Ψ. More-

over, a simulation model of a quite different nature

than the estimated one can be used for estimating

M(Ψ) ln(m(Ψ, Θ)) dΨ. In this case, the use of

M(Ψ) is often reduced into evaluation of sample mo-

ments of Ψ.

More complex models – probabilistic mixtures –

can be estimated by the proposed method using, e.g.,

a slightly modiﬁed quasi-Bayes algorithm (K

arn

et al., 2005).

Physically motivated models, black box models of

much higher order than the estimated one, relation-

ships described by production rules stimulated by real

data in past etc. may serve as data-vectors sources.

In this way, model simpliﬁcation and translation be-

tween various knowledge domains are addressed in a

justiﬁed, purposeful and simple way.

The choice of the weight ν is an algorithmically

open problem. Its solution is, however, predictable:

Bayesian hypothesis testing and real data observed by

the participant should provide ﬂexible universal solu-

tion.

Similarly, the case when information about some

entries of Ψ

is only offered is unsolved. It is expected

that assignment the extension of M to the set Ψ

∗

by a

very ﬂat marginal on “non-reported” entries of Ψ will

solve this problem.

Even with this open problems pending, the pro-

posed “technology” is straightforward to pass un-

certain knowledge from one participant to another

one and thus to combine very different knowledge

sources.

ACKNOWLEDGEMENTS

This research has been partially supported by

CR grant 102/03/0049, AV

CR project BADDYR

1ET100750401, M

SMT grant 1M6798555601 DAR,

and AV

CR project 1ET100750404.

REFERENCES

Barndorff-Nielsen, O. (1978). Information and exponential

families in statistical theory. Wiley, New York.

Berger, J. (1985). Statistical Decision Theory and Bayesian

Analysis. Springer-Verlag, New York.

Jirou

sek, R. (2003). On experimental system for multidi-

mensional model development MUDIN. Neural Net-

work World, (5):513–520.

arn

y, M., B

ohm, J., Guy, T., Jirsa, L., Nagy, I., Nedoma,

P., and Tesa

r, L. (2005). Optimized Bayesian Dynamic

Advising: Theory and Algorithms. Springer, London.

to appear.

arn

y, M., Khailova, N., Nedoma, P., and B

ohm, J.

(2001). Quantiﬁcation of prior information revised.

International Journal of Adaptive Control and Signal

Processing, 15(1):65–84.

Meneguzzo, D. and Vecchiato, W. (2004). Copula sensitiv-

ity in collateralized debt obligations and basket default

swaps. Journal of Futures Markets, 24(1):37–70.

Peterka, V. (1981). Bayesian system identiﬁcation. In

Eykhoff, P., editor, Trends and Progress in System

Identiﬁcation, pages 239–304. Pergamon Press, Ox-

ford.

Savage, L. (1954). Foundations of Statistics. Wiley, New

York.

ICINCO 2005 - SIGNAL PROCESSING, SYSTEMS MODELING AND CONTROL

232