A NEURAL NETWORK APPROACH TO BAYESIAN BACKGROUND

MODELING FOR VIDEO OBJECT SEGMENTATION

Dubravko Culibrk

∗

, Oge Marques

, Daniel Socek

†

Department of Computer Science and Engineering, Florida Atlantic University

Boca Raton FL 33431, USA

Hari Kalva

, Borko Furht

‡

Department of Computer Science and Engineering, Florida Atlantic University

Boca Raton FL 33431, USA

Keywords:

Video processing, Object segmentation, Background modeling, Bayesian modeling, Neural Networks.

Abstract:

Object segmentation from a video stream is an essential task in video processing and forms the foundation

of scene understanding, object-based video encoding (e.g. MPEG4), and various surveillance and 2D-to-

pseudo-3D conversion applications. The task is difﬁcult and exacerbated by the advances in video capture

and storage. Increased resolution of the sequences requires development of new, more efﬁcient algorithms for

object detection and segmentation. The paper presents a novel neural network based approach to background

modeling for motion based object segmentation in video sequences. The proposed approach is designed to

enable efﬁcient, highly-parallelized hardware implementation. Such a system would be able to achieve real

time segmentation of high-resolution sequences.

1 INTRODUCTION

Object detection and segmentation from a video

stream are essential tasks in video processing and

form the foundation of scene understanding, object-

based video encoding (e.g. MPEG4), various surveil-

lance applications, as well as the emerging research

into 2D-to-pseudo-3D video conversion. Increased

complexity of the sequences requires development of

new, more efﬁcient algorithms for object detection

and segmentation.

Commonly used approach to extract foreground

objects from the image sequence is through back-

ground suppression (Haritaoglu et al., 2000)(Stauf-

fer and Grimson, 2000), when the video is grabbed

from a stationary camera. However, the task be-

comes difﬁcult when the background contains shad-

ows and moving objects, and undergoes illumina-

tion changes. A number of proposed techniques are

able to achieve real-time processing of comparatively

small video formats(e.g. 120x160 pixels, CIF res-

olution) and, usually, at somewhat reduced frame

rates. It is unlikely, however, that the existent ob-

ject detection approaches will be able to efﬁciently

cope with the increase in the resolution of video se-

quences. The development of a parallelized object

detection approach, which would allow for efﬁcient

hardware implementation and object detection in real-

time for high-complexity video sequences (in terms of

the frame size as well as background changes), is the

focus of this paper.

The proposed solution employs a feed-forward

neural network to achieve background subtraction. To

this end, a new neural network structure is designed,

serving both as an adaptive Bayesian model of the

background in a video sequence and an algorithm for

background subtraction and foreground object detec-

tion and segmentation. Neural networks posses intrin-

sic parallelism which can be exploited in a suitable

hardware implementation to achieve fast segmenta-

tion of foreground objects.

The rest of the paper is organized as follows: Sec-

tion 2 provides a survey of related published work.

Section 3 describes the main aspects of the proposed

approach. Section 4 is dedicated to the presentation of

simulation results. Section 5 contains the conclusions

and some directions for future work.

2 RELATED WORK

Some of the early object segmentation methods deal-

ing with the instances of non-stationary background

were based on smoothing the color of a background

pixel over time using different ﬁltering techniques

such as Kalman ﬁlters (Karmann and von Brandt,

474

Culibrk D., Marques O., Socek D., Kalva H. and Furht B. (2006).

A NEURAL NETWORK APPROACH TO BAYESIAN BACKGROUND MODELING FOR VIDEO OBJECT SEGMENTATION.

In Proceedings of the First International Conference on Computer Vision Theory and Applications, pages 474-479

DOI: 10.5220/0001374604740479

 SciTePress

1990), or Gabor ﬁlters (Jain et al., 1997) to create a

reference background frame. It is used to segment the

foreground objects by subtracting the reference frame

from the current frame of the input sequence. How-

ever, these methods are not particularly effective for

sequences with high-frequency background changes.

Slightly better results were reported for techniques

that rely on a Gaussian-based statistical model whose

parameters are recursively updated in order to fol-

low gradual background changes within the video se-

quence(Boult et al., 1999). More recently, this model

was signiﬁcantly improved by employing a Mixture

of Gaussians (MoG), where the values of the pixels

from background objects are described by multiple

Gaussian distributions(Stauffer and Grimson, 2000).

This model was considered promising since it showed

good foreground object segmentation results for many

outdoor sequences. However, weaker results were

reported (Li et al., 2004) for video sequences con-

taining non-periodical background changes (e.g. due

to waves and water surface illumination, cloud shad-

ows, and similar phenomena). These models are para-

metric in the sense that they incorporate underlying

assumptions about the probability density functions

(PDFs) they are trying to estimate.

In 2003, Li et al. proposed a method for fore-

ground object detection employing a Bayes decision

framework (Li et al., 2004). The method has shown

promising experimental object segmentation results

for sequences containing complex variations and non-

periodical movements in the background. In addi-

tion to the generic nature of the algorithm where

no a priori assumptions about the scene are neces-

sary, the authors claim that their algorithm can han-

dle a throughput of about 15 fps for CIF video res-

olution. The approach is speciﬁc in the fact that it

uses a statistical model of for the changes between

the current frame and the reference background image

maintained by applying an Inﬁnite Impulse Response

(IIR) ﬁlter to the sequence. A Bayesian classiﬁer is

then used to classify the changes, detected through

frame differencing between the current frame and the

reference frame, as pertinent to background objects

or foreground objects. The statistical model is non-

parametric since it does not impose any speciﬁc shape

to the PDFs learned. The model is general in terms of

features extracted from the sequence and they experi-

mented with the use of different features. The results

of these experiments are reported in (Li et al., 2004).

Recently the approach of Li et al. has been adopted

and extended to create a part of a surveillance sys-

tem intended for maritime environments (Socek et al.,

2005).

While the use of Bayesian models as bases for

background subtraction is not new, it has been limited

by the fact that they are general in the sense that they

impose no constraints on the shape of the estimated

probability density function. This typically makes

them more computationally expensive than most of

their more restrictive counterparts (e.g.(Boult et al.,

1999)(Stauffer and Grimson, 2000)). However, mov-

ing away from the particle estimator systems used

typically to estimate probability density functions in

the Bayesian models (Li et al., 2004) to neural net-

works, it is possible to make them suitable for parallel

execution and increase their effectiveness.

Classical Probabilistic Neural Network

(PNN)(Specht, 1990) architecture has been used

by researchers to improve the object segmen-

tation(Doulamis et al., 2003) and perform the

classiﬁcation of segmented objects (Azimi-Sadjadi

et al., 2001). In both solutions the neural network is a

supervised learning classiﬁer guided by a a different

supervisor classiﬁer algorithm.

In (Doulamis et al., 2003) authors present an unsu-

pervised video object (VO) segmentation and tracking

algorithm based on an adaptive neural-network archi-

tecture. Object tracking is handled as a classiﬁcation

problem and implemented through an adaptive net-

work classiﬁer, which, however, relies on the results

of the initial video object segmentation module to ad-

just itself to the variations of the sequence. The neural

network is but a part of the background segmentation

algorithm. Hence, the whole system does not posses

inherent parallelism of the PNN. As such, the system

is not suitable to serve as basis of an efﬁcient hard-

ware implementation.

An approach employing a PNN classiﬁer in a time

varying environment is proposed in (Azimi-Sadjadi

et al., 2001). A PNN was used to classify clouds

based on their spectral and temperature features in the

visible and infrared GOES 8 (Geostationary Opera-

tional Environmental Satellite) imagery data. A tem-

poral updating approach for the PNN was developed

to increase the classiﬁcation accuracy by accounting

for the temporal changes in the data. The network it-

self is a supervised leaner and is updated every time

a new frame is processed. As in the approach of

(Doulamis et al., 2003), the PNN is a submodule of

the system, and its parallelism can only partially be

exploited in a hardware implementation.

3 BACKGROUND MODELING

NEURAL NETWORK (BNN)

The proposed background modeling and subtraction

approach relies on a novel adaptive neural network.

The architecture employs an adapted General Regres-

sion Neural Network (GRNN) (Specht, 1991) com-

ponent, to serve as an estimator of the PDF of cer-

tain features belonging to background. GRNNs, typi-

cally used as Bayesian classiﬁers, are supervised clas-

A NEURAL NETWORK APPROACH TO BAYESIAN BACKGROUND MODELING FOR VIDEO OBJECT

SEGMENTATION

475

Figure 1: Structure of Background Modeling Neural Network.

siﬁers, requiring a training set. However, in the do-

main of background modeling it was possible to ex-

tend them to form new neural network architecture

which is an unsupervised learner. This Background

Modeling Neural Network (BNN) is suitable to serve

both as a statistical model of the background at each

pixel position in the video sequences and highly par-

allelized background subtraction algorithm. The de-

sign of BNN relies on a basic background modeling

idea: feature values corresponding to background ob-

ject will occur most of the time, i.e. more often than

those pertinent to the foreground.

Three tasks, typical for probabilistic background

modeling (Stauffer and Grimson, 2000)(Li et al.,

2004), which BNN should perform have been iden-

tiﬁed:

1. Storing the values of the features and learning the

probability with which each value corresponds to

background / foreground.

2. Determining the state in which new feature values

should be introduced into the model (i.e. when the

statistics already learned are insufﬁcient to make a

decision).

3. Determining which stored feature value should be

replaced with the new values.

The two latter requirements are consequences of

the fact that real systems are limited in terms of the

number of feature values that can be stored to achieve

efﬁcient performance.

The structure of BNN, shown in Figure 1, has three

distinct subnets. The classiﬁcation subnet is a GRNN

(Specht, 1991). It is a central part of BNN concerned

with approximating the PDF of pixel feature values

belonging to background/foreground. The GRNN is

a neural network implementation of a Parzen estima-

tor (Parzen, 1962). This class of PDF estimators as-

ymptotically approaches the underlying parent den-

sity, provided that it is smooth and continuous.

The classiﬁcation subnet contains three layers of

neurons. Input neurons of this network simply map

the inputs of the network, which are the values of the

features for a speciﬁc pixel. The output of the pattern

neurons is a nonlinear function of Euclidean distance

between the input of the network and the stored pat-

tern for that speciﬁc neuron. The nonlinear function

used is as proposed by Parzen. The only parameter

of this subnet is a so-called smoothing parameter (σ)

used to determine the shape of the nonlinear function.

The structure of a pattern neuron is shown in Figure

2. The output of the summation units of the classiﬁca-

...

-(W-X)(W-X)/2s

(W-X)(W-X)

Figure 2: Pattern neuron of GRNN.

tion subnet is the sum of their inputs. The subnet has

two summation neurons: one to calculate the prob-

ability of pixel values belonging to background and

the other for calculating the probability of belonging

VISAPP 2006 - IMAGE ANALYSIS

476

to foreground.

The classiﬁcation subnet requires no training to

store the patterns (feature values) representative of

background. This is accomplished simply by setting

the weights of the connections between the input and

pattern neurons to the value of the features of the pat-

tern to be stored. The classiﬁcation subnet diverges

from GRNN in the way the weights between the pat-

tern and summation neurons are determined. These

values are used to store the conﬁdence with which a

pattern belongs to the background/foreground. The

weights of these connections are updated with each

new value of a pixel at a certain position received (i.e.

with each frame), according to the following recursive

equations:

t+1

=(1− β) ∗ W

(1)

t+1

=(1− β) ∗ W

+ β (2)

when the maximum response is that of the i-th neuron,

and

t+1

=(1− β) ∗ W

+ β (3)

t+1

=(1− β) ∗ W

(4)

if the maximum response is not that of the j-th neuron,

where:

• W

- value of the weight between the i-th pattern

neuron and the background summation neuron at

time t,

• W

- value of the weight between the i-th pattern

neuron and the foreground summation neuron at

time t,

• β - learning rate.

Equations 1-4 express the notion that whenever an

instance pertinent to a pattern neuron is encountered,

the probability that that pattern neuron is activated by

a feature value vector belonging to the background is

increased. Naturally, if that is the case, the proba-

bility that the pattern neuron is excited by a pattern

belonging to foreground is decreased. Vice versa, the

more seldom a feature vector value corresponding to a

pattern neuron is encountered the more likely it is that

the patterns represented by it belong to foreground ob-

jects. By adjusting the learning rates, it is possible to

control the speed of the learning process.

The output of the classiﬁcation subnet indicates

whether the output of the background summation neu-

ron is higher than that of the foreground summation

neuron, i.e. that it is more probable that the input fea-

ture value is due to a background object rather than a

foreground object.

The activation and replacement subnets are

Winner-Take-All (WTA) neural networks. A WTA

network is a parallel and fast way to determine mini-

mum or the maximum of a set of values, consistent

with the task of doing so within a neural-network

based solution. In particular, these subnets are ex-

tensions of one-layer feedforward MAXNET (1LF-

MAXNET) proposed in (Kwan, 1992).

The activation subnet performs a dual function: it

determines which of the neurons of the network has

maximum activation (output) and whether that value

exceeds a threshold provided as a parameter to the al-

gorithm. If it does not, the BNN is considered inactive

and replacement of a pattern neuron’s weights with

the values of the current input vector is required. If

this is the case, the feature is considered to belong to

a foreground object.

The ﬁrst layer of this network has the structure of a

1LF-MAXNET network and a single neuron is used

to indicate whether the network is active. The output

of the neurons of the ﬁrst layer of the network can be

expressed in the form of the following equation:

= X



i=1

{F (X

− X

|i = j)} (5)

where:

F (z)=



1, if z ≥ 0;

0, if z<0;

(6)

The output of the ﬁrst layer of the activation subnet

will differ from 0 only for the neurons with maxi-

mum activation and will be equal to the maximum ac-

tivation. In Figure 1 these outputs are indicated with

,,Z

. A single neuron in the second layer of the

activation subnet is concerned with detecting whether

the BNN is active or not and its function can be ex-

pressed in the form of the following equations:

NA = F (



i=1

− θ) (7)

where F is given by Equation 6 and θ is the activation

threshold, which is provided to the network as a para-

meter. Finally, the replacement subnet in Figure 1 can

be viewed as a separate neural net with the unit input.

However, it is inextricably related to the classiﬁcation

subnet since each of the replacement subnet ﬁrst-layer

neurons is connected with the input via synapses that

have the same weight as the two output synapses be-

tween the pattern and summation neurons of the clas-

siﬁcation subnet. Each pattern neuron has a corre-

sponding neuron in the replacement net. The func-

tion of the replacement net is to determine the pattern

neuron that minimizes the criterion for replacement,

expressed by the following equation:

replacement

criterion = W

+ |W

− W

| (8)

The criterion is a mathematical expression of the idea

that those patterns that are least likely to belong to the

background and those that provide least conﬁdence

to make the decision should be eliminated from the

model.

A NEURAL NETWORK APPROACH TO BAYESIAN BACKGROUND MODELING FOR VIDEO OBJECT

SEGMENTATION

477

(a) (b) (c)

Figure 3: Results obtained for representative frames: (a) frame from the original sequence, (b) segmentation results obtained

for the frame shown, (c) segmentation result with morphological enhancement.

The neurons of the ﬁrst layer calculate the negated

value of the replacement criterion for the pattern neu-

ron they correspond to. The second layer is a 1LF-

MAXNET that yields non-zero output corresponding

to the pattern neuron to be replaced.

To form a complete background-subtraction solu-

tion a single instance of a BNN is used to model the

features at each pixel of the image.

4 EXPERIMENTS AND RESULTS

The approach is intended to serve as basis for the de-

sign of a hardware component, which would be able

to exploit its highly parallel nature. However, to eval-

uate the approach, a simulation application, which

can be run on a typical PC, has been developed. While

this simulation is sequential in its execution and can-

not provide a valid estimate of the speed of the target

hardware system, it can demonstrate the segmentation

ability of the system.

The approach has been tested on a number of mar-

itime environment sequences. All of the sequences

used in the experiments are currently available at the

following link: http://sasi.cse.fau.edu/ccst/sequences.

Primary sequence used for testing is that of a port in-

let, taken by a static camera. The sequence consists

of 18230 frames of 720x480 pixels, corresponding to

a bit more than 10 minutes of recording at 30 frames

per second. It contains a large number of diverse ves-

sels, in terms both of color and size, moving at differ-

ent rates, in different directions and at different dis-

tance from the camera. The sequence is also complex

in terms of background changes related to the water-

surface.

A representative frame from the sequence as well

as the result of segmentation are given in Figure 3.

Grey pixels correspond to the foreground. Figure 3(c)

shows the same segmentation result when a morpho-

logical open and then morphological close operation

are applied. The support region for the operations is a

two pixel wide square.

A detail of the ﬁrst frame of the original sequence

(shown in ﬁgure 3(a)) containing several small objects

as well as the segmentation result for that part of the

frame with and without morphological operations ap-

plied is shown enlarged in Figure 4. Light grey pixels

are classiﬁed as foreground due to the BNNs recog-

nizing that these are new values not yet stored, while

the dark grey ones are stored but classiﬁed as fore-

ground based on the learned PDFs.

The neural networks used in the experiments are

fairly simple. The simulation application implements

BNNs containing 20 processing, two summation and

one output neuron per pixel in the classiﬁcation sub-

net. The activation and replacement subnet attribute

for additional 20, i.e. 41 processing units respectively,

bringing up the total of neurons used per pixel to 84.

The input neurons of the classiﬁcation shown in Fig-

ure 1 just map the input to the output and need not be

implemented as such.

The learning rate (β) of the networks was set to

0.005 and the smoothing parameter (σ) for the clas-

siﬁcation subnet used was set to 10. The activation

threshold (θ) of the activation subnet was set to 0.95.

The performance of the simulation application al-

lows for efﬁcient experimenting. It is capable of

processing a single frame of size 720×480 in 2.25

seconds on average, which translates to 8 frames of

160×120 pixels per second or 2.2 frames per second

(fps) for images sized 320×240 pixels, on a 3.0 GHz

Pentium IV based system.

5 CONCLUSION AND FURTHER

RESEARCH

Object segmentation is a fundamental task in several

important domains of video processing. The com-

plexity of captured and stored video material is on

the rise and current motion based segmentation algo-

rithms are not capable of handling high-resolution se-

quences in real time. The possibility of resolving this

problem through a highly parallelized approach is the

VISAPP 2006 - IMAGE ANALYSIS

478

(a) (b) (c)

Figure 4: Details of: (a) a frame of the original sequence, (b) segmentation result and (c) segmentation result enhanced using

morphological operations.

focus of the research presented.

The research resulted in a new motion based ob-

ject segmentation and background modeling algo-

rithm, proposed here. It is parallelized at sub-pixel

level. The basis of the approach is employment of

a novel neural network architecture designed speciﬁ-

cally to serve as a model of background in video se-

quences and a Bayesian classiﬁer to be used for ob-

ject segmentation. The new Background Modeling

Neural Network is an unsupervised classiﬁer, differ-

ing from the approaches published before. The pro-

posed model is independent of the features used and

general since it does not impose restrictions in terms

of the probability density functions estimated.

A PC based system has been developed to evalu-

ate the algorithm using a complex maritime sequence.

The results obtained through these experiments are il-

lustrated in the paper via representative frames.

Full exploitation of the algorithm’s parallelism can

be achieved only if the system is implemented in hard-

ware, allowing for highly-parallelized execution.

Future work will proceed in several directions: Use

of features different than RGB values will be explored

to evaluate the impact of the choice of features on the

performance of the system. Methods to enhance the

segmentation, other than morphological transforma-

tions, will be explored (e.g. single frame color-based

segmentation, depth cues from stereo sequences). Fi-

nally, development of a FPGA based system which

would achieve real time segmentation of HDTV and

QuadHDTV sequences will be explored.

REFERENCES

Azimi-Sadjadi, M. R., Gao, W., Haar, T. H. V., and Reinke,

D. (2001). Temporal updating scheme for probabilis-

tic neural network with application to satellite cloud

classiﬁcation-further results. In IEEE Trans. Neural

Networks, vol. 12, pp. 1196-1203.

Boult, T., Micheals, R., X.Gao, Lewis, P., Power, C., Yin,

W., and Erkan, A. (1999). Frame-rate omnidirec-

tional surveillance and tracking of camouﬂaged and

occluded targets. In Proc. of IEEE Workshop on Vi-

sual Surveillance, pp. 48-55.

Doulamis, A., Doulamis, N., Ntalianis, K., and Kollias,

S. (2003). An efﬁcient fully unsupervised video ob-

ject segmentation scheme using an adaptive neural-

network classiﬁer architecture. In IEEE Trans. On

Neural Networks, vol. 14, pp. 616-630.

Haritaoglu, I., Harwood, D., and Davis, L. (2000). W4

real-time surveillance of people and their activities.

In IEEE Trans. Pattern Analysis and Machine Intel-

ligence, vol. 22, pp. 809-830.

Jain, A., Ratha, N., and Lakshmanan, S. (1997). Object

detection using gabor ﬁlters. In Journal of Pattern

Recognition, vol. 30, pp. 295-309.

Karmann, K. P. and von Brandt, A. (1990). Moving ob-

ject recognition using an adaptive background mem-

ory. In Timevarying Image Processing and Moving

Object Recognition, 2, pp. 297-307. Elsevier Publish-

ers B.V.

Kwan, H. K. (1992). One-layer feedforward neural network

fast maximum/minimum determination. In Electron-

ics Letters, pp. 1583-1585.

Li, L., Huang, W., Gu, I., and Tian, Q. (2004). Statistical

modeling of complex backgrounds for foreground ob-

ject detection. In IEEE Trans. Image Processing, vol.

13, pp. 1459-1472.

Parzen, E. (1962). On estimation of a probability density

function and mode. In Ann. Math. Stat., Vol. 33, pp.

1065-1076.

Socek, D., Culibrk, D., Marques, O., Kalva, H., and Furht,

B. (2005). A hybrid color-based foreground object de-

tection method for automated marine surveillance. In

Proc. of the Advanced Concepts for Intelligent Vision

Systems Conference (ACIVS 2005).

Specht, D. F. (1990). Probabilistic neural networks. In

Neural Networks, vol. 3, pp. 109-118.

Specht, D. F. (1991). A general regression neural network.

In IEEE Trans. Neural Networks, pp. 568-576.

Stauffer, C. and Grimson, W. (2000). Learning patterns of

activity using real-time tracking. In IEEE Trans. Pat-

tern Analysis and Machine Intelligence, vol. 22, pp.

747-757.

A NEURAL NETWORK APPROACH TO BAYESIAN BACKGROUND MODELING FOR VIDEO OBJECT

SEGMENTATION

479