the control vector τ
a
. In other words, we would
like to estimate the vector b
t
(equation 2) at time
t given all the observed data until time t, denoted
y
1:t
≡{y
1
,...,y
t
}. In a tracking context, the model
parameters associated with the current frame will be
handed over to the next frame.
For each input frame y
t
, the observation is simply
the warped texture patch (the shape-free patch) as-
sociated with the geometric parameters b
t
. We use
the
HAT symbol for the tracked parameters and tex-
tures. For a given frame t,
ˆ
b
t
represents the com-
puted geometric parameters and
ˆ
x
t
the corresponding
shape-free patch, that is,
ˆ
x
t
= x(
ˆ
b
t
)=W(y
t
,
ˆ
b
t
) (4)
The estimation of
ˆ
b
t
from the sequence of images
will be presented in the next Section.
The appearance model associated with the shape-
free facial patch at time t, A
t
, is time-varying on that
it models the appearances present in all observations
ˆ
x up to time (t − 1). We assume that the appearance
model A
t
obeys a Gaussian with a center µ and a vari-
ance σ. Notice that µ and σ are vectors composed of
d components/pixels (d is the size of x) that are as-
sumed to be independent of each other. In summary,
the observation likelihood at time t is written as
p(y
t
|b
t
)=p(x
t
|b
t
)=
d
i=1
N(x
i
; µ
i
,σ
i
) (5)
where N(x; µ
i
,σ
i
) is the normal density:
N(x; µ
i
,σ
i
)=(2πσ
2
i
)
−1/2
exp
−
1
2
x − µ
i
σ
i
2
(6)
We assume that A
t
summarizes the past observations
under an exponential envelop, that is, the past obser-
vations are exponentially forgotten with respect to the
current texture. When the appearance is tracked for
the current input image, i.e. the texture
ˆ
x
t
is avail-
able, we can compute the updated appearance and use
it to track in the next frame.
It can be shown that the appearance model parame-
ters, i.e., µ and σ can be updated using the following
equations (see (Jepson et al., 2003) for more details
on Online Appearance Models):
µ
t+1
=(1− α) µ
t
+ α
ˆ
x
t
(7)
σ
2
t+1
=(1− α) σ
2
t
+ α (
ˆ
x
t
− µ
t
)
2
(8)
In the above equations, all µ’s and σ
2
’s are vec-
torized and the operation is element-wise. This tech-
nique, also called recursive ﬁltering, is simple, time-
efﬁcient and therefore, suitable for real-time applica-
tions. The appearance parameters reﬂect the most re-
cent observations within a roughly L =1/α window
with exponential decay.
Note that µ is initialized with the ﬁrst patch
ˆ
x
0
.
In order to get stable values for the variances, equa-
tion (8) is not used until the number of frames reaches
a given value (e.g., the ﬁrst 40 frames). For these
frames, the classical variance is used, that is, equa-
tion (8) is used with α being set to
1
t
.
Here we used a single Gaussian to model the ap-
pearance of each pixel in the shape-free patch. How-
ever, modeling the appearance with Gaussian mix-
tures can also be used on the expense of some addi-
tional computational load (e.g., see (Zhou et al., 2004;
Lee, 2005)).
4 TRACKING USING ADAPTIVE
APPEARANCE REGISTRATION
We consider the state vector b =
[θ
x
,θ
y
,θ
z
,t
x
,t
y
,t
z
,τ
a
T
]
T
encapsulating the 3D
head pose and the facial actions. In this section, we
will show how this state can be recovered for time t
from the previous known state
ˆ
b
t−1
and the current
input image y
t
.
The sought geometrical parameters b
t
at time t are
related to the previous parameters by the following
equation (
ˆ
b
t−1
is known):
b
t
=
ˆ
b
t−1
+∆b
t
(9)
where ∆b
t
is the unknown shift in the geometric pa-
rameters. This shift is estimated using a region-based
registration technique that does not need any image
feature extraction. In other words, ∆b
t
is estimated
such that the warped texture will be as close as pos-
sible to the facial appearance A
t
. For this purpose,
we minimize the Mahalanobis distance between the
warped texture and the current appearance mean,
min
b
t
e(b
t
) = min
b
t
D(x(b
t
),µ
t
)=
d
i=1
x
i
− µ
i
σ
i
2
(10)
The above criterion can be minimized using itera-
tive ﬁrst-order linear approximation which is equiv-
alent to a Gauss-Newton method. It is worthwhile
noting that the minimization is equivalent to maxi-
mizing the likelihood measure given by (5). More-
over, the above optimization is carried out using Hu-
ber function (Dornaika and Davoine, 2004). In the
above optimization, the gradient matrix
∂W (y
t
,b
t
)
∂b
t
=
∂x
t
∂b
t
is computed for each frame and is approximated
by numerical differences similarly to the work of
Cootes (Cootes et al., 2001).
On a 3.2 GHz PC, a non-optimized C code of the
approach computes the 3D head pose and the six fa-
cial actions in 50 ms. About half that time is required
VISAPP 2006 - MOTION, TRACKING AND STEREO VISION
312