REAL-TIME LIPTRACKING FOR SYNTHETIC FACE ANIMATION

WITH FEEDBACK LOOP

Franck Luthon

University of Pau and Adour River, Computer Science Lab. LIUPPA

IUT Ch

ateau Neuf, Place Paul Bert, 64100 Bayonne, France

Brice Beaumesnil

University of Pau and Adour River, Computer Science Lab. LIUPPA

IUT Ch

ateau Neuf, Place Paul Bert, 64100 Bayonne, France

Keywords:

Segmentation, Closed-Loop, Hue, Motion, Snake, Active Contour, Talking Head, 3D-Model.

Abstract:

This article deals with facial segmentation and liptracking with feedback control for real-time animation of

a synthetic 3D face model. Straightforward approaches consist in two successive steps: video analysis then

synthesis. Our approach departs from the previous ones in that we build a global analysis/synthesis processing

loop, where the image analysis needs the 3D synthesis and conversely. A ﬁrst facial segmentation is computed

according to which the 3D face model is positionned. Then the feedback loop, implemented from the 3D

animated model back to the input pixel segmentation algorithm, helps to correct some (few) control points

that were badly tracked, which are detected by measuring the vertical distance between lip contour and cor-

responding 3D face model. When this distance is too big, we re-enter into the image segmentation process

and zoom-in inside a few regions of interest (ROI) where the algorithm is run again, with a new set of tun-

ing parameters better suited to the pixel neighborhood context. In that way, the face segmentation is reﬁned

in order to extract more precise parameters. This approach is inspired from control theory with closed-loop

systems. The contribution of the paper is to use simple image processing techniques, but to improve segmen-

tation through the feedback loop. Results show that real-time and robust performances are achievable under

real-world conditions, which are two key issues for face and lip tracking applications.

1 INTRODUCTION

We present a complete real-time analysis/synthesis

framework allowing lip tracking for animation of a

clone with a single camera in unconstrained environ-

ment (typ. webcam in the ofﬁce). The approach

is based on lip segmentation from a hue component

computed within a non-linear color space that is ro-

bust to luminosity variations. Inner and outer active

contours are extracted, and then interpreted to make a

real-time realistic animation of a clone’s mouth.

For synthetic talking head animation, realistic ren-

dering of lip motion is the key point. The purpose of

this paper is to link in real-time the face image analy-

sis with the synthesis of an animated 3D model of

the head. We focus on the speaker’s lip video seg-

mentation from a mono-camera (webcam or motor-

ized camera) for on-line animation of the mouth of a

clone, without dealing here with the sound informa-

tion (no speech processing). In future developments

of course, speech processing should be coupled with

image processing for optimal audiovisual rendering.

But here, we want to investigate what one can do with

the video only. Our aim is to get a realistic render-

ing of the mouth motion, but not necessarily to get

the most precise lip contour extraction. It means that

image analysis needs not to be very sophisticated, but

just good enough for our application (i.e., get an ac-

ceptable rendering for a realistic animation). We pro-

pose to compensate for some defaults in the analy-

sis by implementing a feedback loop form the 3D

synthesis towards the input segmentation process. In

other words, the low-level (early vision) segmentation

process is corrected (on-line at video rate) by some

high-level information coming back from the 3D syn-

thetic face (with all its semantics and animation con-

straints) through a retroaction channel. Hence, the

key point of our scheme is not to use or develop com-

plex and time consuming algorithms, but to use a rela-

tively simple segmentation algorithm, and to take ad-

vantage of the feedback channel to improve the seg-

mentation quality in areas where it failed at the ﬁrst

run. Therefore, our efforts are dedicated to the design

of the global real-time processing chain (from video

402

Luthon F. and Beaumesnil B. (2006).

REAL-TIME LIPTRACKING FOR SYNTHETIC FACE ANIMATION WITH FEEDBACK LOOP.

In Proceedings of the First International Conference on Computer Vision Theory and Applications, pages 402-407

DOI: 10.5220/0001367604020407

 SciTePress

input towards synthetic clone output).

Most of the approaches proposed in the literature

do not implement this feedback loop. They are of-

ten based on high level methods like AAM (Cootes

et al., 1998), that require large-sized video training

databases for off-line learning. If they are well suited

for the whole face (Batur and Hayes, 2005), they are

not precise enough for the lips.

Our approach departs from the above mentioned

methods in the sense that we want to use low-level

algorithms that can adapt to any mouth shape without

a priori knowledge about the face. The main adavan-

tages of such a strategy is that real-time is achievable

and that neither (possibly heavy) learning stages nor

huge databases (to derive a priori knowledge about

faces) are required.

In the literature, some authors have already pro-

posed to use a feedback strategy, either for texture

segmentation (Pichler et al., 1998) (with K-means

classiﬁcation in the open-loop, smoothing ﬁlter in

the closed-loop), object recognition (Mirmehdi et al.,

1999) (with three control strategies at low, interme-

diate and high level). For video object tracking (Er-

dem et al., 2003) proposes a scheme with boundary

prediction in the open-loop, boundary correction in

the closed-loop, and performance measures without

ground-truth (Erdem et al., 2004) that yields some

nice properties and results: non rigid object tracking,

robustness to occlusions, no need of training thanks to

automatic weight control, on-line tracking for coarse

estimates. However, pixel accurate boundary tracking

is only achieved off-line. Moreover, the use of ac-

tive contours requires a proper initialisation (via user

interface as in (Fu et al., 2000)). In (Erdem et al.,

2003), the boundary initialisation on the ﬁrst frame is

done manually by the user in an interactive (non-real

time) mode. In real-time mode, a change detection al-

gorithm is proposed but not implemented in real-time.

Taking experience from those previous works, we

propose a contribution in liptracking to remove the

constraint of manual initialisation or training and to

achieve real-time processing with pixel-accuracy.

2 DESCRIPTION OF THE

PROCESSING LOOP

Fig. 1 illustrates our global approach. The forward

stage of our framework is divided into four parts:

1. low-level color segmentation (analysis): it works

in a color space that is little sensitive to lighting

variations and exhibits very distinctly skin and lip

hue areas.

2. active contour positioning (estimation): to delin-

eate both inner and outer lip contours.

Figure 1: Feedback scheme for real-time coupling of lip-

tracking with animation.

3. transmission of geometrical parameters (communi-

cation): via Internet to a distant computer

4. clone animation (interpretation/synthesis): it uses

the lip contours extracted at step 2 and transmitted

at step 3.

The backward stage consists in three steps:

5. error measurement at the receptor side

6. backward transmission of some ROIs that need to

be processed again (because of poor segmentation)

7. re-run of the segmentation process on those ROIs

(step 1)

Our work assumes that we have at our disposal an

automatic tool for face detection and tracking, so that

an optimal framing of the speaker’s face is available

(which is required in the case of videoconferencing

for example). This is part of another project about in-

telligent videoconferencing that is under study at our

laboratory

. If this is not the case, the only constraint

is that the speaker should stay in front of the camera

with little motion (normal behavior in front of a web-

cam). In this paper, we simply ask the speaker to seat

in front of the camera so that his face covers the major

part of the image. Moreover, we ask him to have the

mouth closed (neutral position) on the ﬁrst frames of

the sequence, in order to make a good initialisation of

his mouth proportions (width and height). This allows

to take into account the distance between the speaker

and the camera, that may vary from one acquisition

to another, but that is important for 3D model cali-

bration and clone animation. Therefore, the whole

image frame is considered as the search area at the

beginning.

Then the speaker should stay at about the same dis-

tance from the camera during the session, so that the

mouth shape proportions remain the same (since the

camera is supposed to be static here).

Lighting conditions are not calibrated nor constant:

they correspond to realisitic ofﬁce environment (non

uniform lighting, light sources that may be added or

suppressed depending on the daylight).

Starting from two previous works: an MRF clas-

siﬁcation approach based on hue and motion detec-

tion (Li

evin and Luthon, 2004), and an active contour

Laboratoire d’Informatique de l’Universit

e de Pau et

Pays de l’Adour, France, http://liuppa.univ-pau.fr/

REAL-TIME LIPTRACKING FOR SYNTHETIC FACE ANIMATION WITH FEEDBACK LOOP

403

Figure 2: a) Original image; b) Hue computed from LU X;

c) Hue from HSL color space; d) Hue computed as G/R

(often used for face/lip segmentation).

approach by Delmas et al. (Li

evin et al., 1999), we

have developed our framework, that connects synthe-

sis with anlaysis through a feedback loop, in order to

counteract the defaults of the image analysis part and

achieve the following goals:

• good performance in unconstrained environ-

ment (ordinary lighting conditions, small head-

movements allowed)

• real-time running of the analysis/synthesis loop

(i.e. at video rate: 25 or 30 fps)

• enough precision on the inner lip contour for real-

istic animation

• to cope with hair, beard, race, gender.

2.1 Extraction of Lip Area

For face and lip segmentation, we work in the LU X

color space (Li

evin and Luthon, 2004). This color

space is non-linear with respect to the RGB color

components. It helps to reinforce the color contrast

while being relatively insensitive to lighting varia-

tions. In this color space, most of the color informa-

tion relative to a human face is coded by the hue com-

ponent (red chromaticity) in the particular case when

R>L(L being the luminance Eq. 1). Therefore, we

derive from LUX space the simpliﬁed hue compo-

nent U, Eq.(2), that is more discriminating than RGB

or HSL for face and lip segmentation (see Fig. 2).

L =(R +1)

0.3

(G +1)

0.6

(B +1)

0.1

− 1 (1)

U =



256

L+1

R+1

si R>L,

255 otherwise or if L<λ.

(2)

The threshold λ is introduced in order to detect very

dark areas like the inner side of the mouth or nostrils.

This will amplify the various gradients computed on

the hue, as explained below.

Since the hue difference between face and lip is

well contrasted in LUX space, we can easily classify

pixels as lips or face by simple use of the K-means

classiﬁcation algorithm. The algorithm works with

three classes : lips, face and background; it exploits

two types of low-level information: the mean value

of hue in a given neighborhood, and the maximum

deviation from this mean value for any pixel in the

neighborhood. Compared to MRF-based classiﬁca-

tion that induces high computing time, the choice of

the K-means technique is justiﬁed be its lower com-

putation cost.

But a good initialisation of the centers of the three

classes is required for proper convergence of the al-

gorithm. Since we want to avoid heavy learning stage

with huge data-bases, we use an heuristic approach to

set ad hoc parameters that only depend on the mean-

value of speaker’s face hue in the whole image (that

is easily estimated through histogram computation).

This allows a good initialisation of the class centers,

and hence a rapid convergence towards the solution

(Fig. 3d).

In addition, we have implemented a non linear ﬁl-

ter that searches for the lip bounding box (BB) in the

following way: since the mouth is the bigger reddish

horizontal form in the face, we look for this kind of

feature inside the face among pixels belonging to the

lip-class. We scan sequentially all pixels belonging to

the lip-class and apply the nonlinear ﬁlter mask de-

ﬁned by Eq.(3). If the neighbors belong to the same

class (lip), their ﬁltered value is weighted according

to their position and contributes to the current pixel if

it also belongs to lips. Otherwise, the current pixel is

set to zero.

M(i)=





j∈ν(i)

M(j) if U (i) ∈ lip

0 otherwise.

(3)

M is a matrix of size L × C (image size) initialised

to 0 and the a

(with j ∈ ν(i)) are the coefﬁcients at-

tributed to the different pixels belonging to the causal

and connex neighborhood ν(i) of the current pixel i

(see Fig.3a).

This nonlinear ﬁlter technique yields in a single

pass (rapid processing) the localisation of the lips.

The mask gives more importance to elongated forms

in the horizontal direction, which allows to make the

difference between ears and mouth, that are usually

the two biggest reddish zones in a face (Fig. 3d). Af-

ter ﬁltering, we get a map of the face with the higher

values located on the largest reddish forms, among

which the mouth with the highest value of the map

(usually at the lower left edge of the lips, see white

cross in Fig. 3d).

This simple technique gives an approximate BB

for mouth positionning in the image plane (Fig. 3b).

However, it may fail when lip corners are not well

detected or when lower and upper lips form two sep-

arated areas (this may happen e.g. in the case of the

VISAPP 2006 - MOTION, TRACKING AND STEREO VISION

404

Figure 3: a) Hue U with non-linear ﬁlter mask coefﬁcients

; b) Original color image, with bounding-box and outer

lip contour points; c) Hue Histogram; d) 3-class map given

by the k-means algorithm, with pixel having the highest

value of M in white.

Figure 4: Liptracking for real-time animation of a 3D face

model: a) Video input: color image ; b) Initialisation of

outer snake obtained from bounding box (in blue); c) In-

ner and outer snakes after convergence; d) Animated clone:

synthetic talking head position.

French phoneme [a]). To cope with those situations,

a lip tracker is added: it is based on Lucas-Kanade al-

gorithm applied on a few relevant points detected on

the outer lip contour. Since the speaker is supposed

to have his mouth closed in the ﬁrst image, we are

guaranteed to have a single lip area at the beginning,

so that the outer lip contours can be tracked properly

afterwards, even if the two lips are no more connected

areas in the video sequence. Using this simple motion

estimation technique instead of motion detection as in

(Li

evin and Luthon, 2004) yields a better robustness

to lighting variations.

Having this complementary information, we are

able to complete the face classiﬁcation map and get

the whole mouth area properly: we look for one or

two forms close to the tracked points and we link them

together. We then get the mouth optimal BB by a sim-

ple scan of the connected component lip-hue area that

includes the pixel detected as being the most likely

on the lip border (point giving the maximum value of

M, i.e. located on the bigger horizontal reddish form,

Fig. 3d).

2.2 Snakes and Clone Animation

For estimating the lips outer border, one active con-

tour or snake (Kass et al., 1987) is initialised with the

BB detected as explained above.

This snake is made of a ﬁnite number of control

points (typ. 2 × 16 = 32 points) that are forced to

undergo only vertical displacements during iterations.

The points are initialised on cubic curves computed

from the BB and from the lip map (to position the lip

corners), Fig. 4b. The forces used for snake conver-

gence are the following:

• an internal force that controls elasticity and curva-

ture (very classicaly deﬁned)

• an external force that speciﬁes the features that

should attract the snake (namely spatial gradients

computed both on hue and on luminance maps)

• a constraint force that is speciﬁc of the problem at

hand (the snake is forced to converge towards the

gravity center of the BB).

After convergence of the outer snake, another

snake (inner one) is initialised on the outer one, then

shrinked by an anisotropic scaling w.r.t. the mouth

center and taking into account the actual thickness of

lips.

After convergence of the inner snake towards the

inner lip contour, the interpretation step for our appli-

cation purpose is readily done: we are able to com-

pute various geometrical features from the control

points of the two snakes, that are then used as input

parameters for the 3D head model. This computation

of animation parameters is of course dependent on the

3D model used. Our talking head is a 3D clone from

ICP

. It is built with 275 mobile points that allow re-

alistic synthesis of mouth motion thanks to six anima-

tion parameters (dedicated to visemes and phonemes

of the French language) (Beno

ıt et al., 1992). In our

case, a simple system of linear equations transforms

the snakes control points coordinates into animation

parameters that are taken as input by the 3D model.

Actually, we only use four control points: left and

right lip corners, central upper and lower lip positions.

3 EXPERIMENTAL RESULTS

We present in this section two types of results:

• image analysis results that show the performance

of lip segmentation, BB detection, lip contours es-

timation (Fig. 5)

• image synthesis results that show various positions

of the animated model computed/interpreted from

the points estimated in the ﬁrst step, and show also

some cases of failure that are corrected by the feed-

back loop.

Thanks to Institut de la Communication Parl

ee, Greno-

ble, France, http://www.icp.inpg.fr

REAL-TIME LIPTRACKING FOR SYNTHETIC FACE ANIMATION WITH FEEDBACK LOOP

405

Figure 5: a) Snakes after convergence; b) lip hue classiﬁca-

tion map; c) gradients; d) color image.

3.1 Analysis Result: Segmentation

Fig. 2 shows the interest of using color components

derived from the non linear LU X color space, espe-

cially for skin hue segmentation in adverse lighting

conditions: one can see that lip hue is clearly distinct

from the face skin hue (see Fig. 2b), even in the case

of a non uniform lighting (here the face is enlighted

sideway).

The algorithm is robust to lighting variations:

abrupt light changes (like putting on/off a lamp in the

room) do little inﬂuence the quality of segmentation

since relevant hue mean values (for each of the three

classes) are re-estimated on-line as each new frame is

acquired, so that the classiﬁcation algorithm adapts to

light changes over time.

Since our hue-based algorithm does not rely on

any a priori knowledge about the face (apart from the

mean value of hue that is estimated online), it learns

special features like beard or mustaches at initialisa-

tion (a case of hairy face is illustrated in Fig. 6).

This lip segmentation process yields precise loca-

tion of the mouth in the face, with a BB including

completely the lips, which gives a good initialisation

for the snakes.

3.2 Synthesis Result: Animation

After BB detection and snake convergence on the lip

contours (Fig. 4 and 5), one can interpret the detected

points for lip animation of the 3D model.

The keypoint is that even if lip contours are not very

well estimated (poor initial segmentation), it is not

redhibitory for a realistic animation. Indeed the com-

parison between the estimated snake locations and the

3D model positions allows to make a backward cor-

rection (through the feedback loop).

As regards the inner lip contour, teeth are not a

problem: they even amplify the gradients, ensuring

a better convergence (see Fig. 6). However, another

issue that has to be addressed is the presence of re-

Figure 6: Typical analysis/synthesis results for different

visemes corresponding to French phonemes: top) vowels

[o] and [a] with corresponding synthetic clone positions;

bottom) vowels [e] and [∂] plus the neutral position.

ﬂected highlights on lips (Fig. 6). The proposed feed-

back processing helps to solve this problem (Fig. 7).

As a matter of fact, outer lip contours are very well

estimated when the mouth is wide open. On the op-

posite, inner lip contours are well estimated when the

mouth is closed or in the case of protrusion We may

use this fact in conjunction with the 3D result to guide

the processing of re-computation where the segmen-

tation initialy failed. This is what we are currently

investigating.

4 DISCUSSION AND FUTURE

WORK

We have shown that real-time realistic animation of a

synthetic talking head is achievable with :

• a rapid initialisation without heavy learning stage,

• a few points tracked by a fast segmentation scheme

• the use of a feedback loop to correct some segmen-

tation defaults.

Currently, only vowels were tested carefully. The

clone reproduces the speaker’s lip motion with a very

nice precision. It moves even to fast ; we have to slow

down the animation process in order to reproduce re-

alistic human lip motion: in fact, this will be very

easy to implement because of the feedback control.

As in the case of control systems (cf. PID correctors

in automation theory), one can give to the global chain

the right time constant and damping factor in order to

guarantee stability and to reach the best compromise

between speed and precision.

This work demonstrates that one can build a com-

plete anlaysis-synthesis chain that works in real-time

for 3D head animation, without using sound informa-

tion nor face databases for learning.

The algorithmic part is made of relatively sim-

ple (and hence rapid) image processing steps: LU X

color transform for robustness to lighting conditions

VISAPP 2006 - MOTION, TRACKING AND STEREO VISION

406

Figure 7: Regularisation through the 3D model: a) Good

match between segmentation and model; b) Bad match: a

control point of the inner snake is clearly erroneous and

is corrected by the 3D mesh corresponding position ; c)d)

After correction through feedback: front view with 3D lip

model reshaped.

and lip separation, K-means classiﬁcation with 3 hue

classes, non linear ﬁlter for BB initial localisation,

motion estimation via LK algorithm for Bounding

Box tracking, snake initialisation deduced from the

BB position, snake convergence towards inner and

outer lip contours. The (eventually) badly segmented

points (detected by a distance measure on the 3D

model) are re-processed (thanks to the closed-loop).

Currently, we simply replace the bad point by the cor-

responding point of the model. But we will now im-

plement a reprocessing on that pixel (zoom-in in the

neighborhood and local re-computation in that little

area).

Indeed we still have some time left to spend at this

reprocessing: the whole analysis algorithm, imple-

mented in non optimised C-code on an i386 processor

at 1.4GHz, works in real-time (i.e. processing rate

better than 30Hz). This is 30 times faster than the

algorithm presented in (Li

evin and Luthon, 2004).

Another direction of our current research is to seg-

ment not only the lips, but also other face features

(namely nostrils, eyes, eyebrows and ears). This will

help to get other relevant points for 3D model scal-

ing (cf. user-dependent facial geometry taken into ac-

count at initialisation step), for regularisation of poor

input data (cf. re-segmentation through the feedback

loop that adds rigidity constraints and other seman-

tics to the ill-posed problem at the pixel side) and for

more realistic animation (cf. face expressions during

speech).

Moreover, having more information on the whole

face may enable a better understanding of some spo-

ken phonemes (e.g. to make the difference between

French phonemes [e] and [y], the nose position is

very important for animation: indeed, from a visual

perception viewpoint, the mouth has almost the same

shape in both cases. The only difference is that for

phoneme [y], the mouth is closer to the nose

Finally, as we want to be able to animate various

clones and propose a generic solution, we are also

working on the use of MPEG4-compliant 3D models

(using FDP and FAP, facial animation parameters).

REFERENCES

Batur, A. U. and Hayes, M. H. (2005). Adaptive active ap-

pearance models. IEEE Trans. on Image Processing,

14(11):1707–1721.

Beno

ıt, C., Lallouache, T., Mohamadi, T., and Abry, C.

(1992). A set of French visemes for visual speech

synthesis. In Bailly, G., Beno

ıt, C., and Sawallis,

T., editors, Talking Machines: Theories, Models and

Designs, pages 485–504, Amsterdam, North-Holland.

Elsevier Science Publishers B.V.

Cootes, T., Edwards, G., and Taylor, C. (1998). Ac-

tive appearance models. In European Conference on

Computer Vision, ECCV, volume 2, pages 484–498.

Springer-Verlag.

Erdem, C. E., Sankur, B., and Tekalp, A. M. (2004).

Performance measures for video object segmentation

and tracking. IEEE Trans. on Image Processing,

13(7):937–951.

Erdem, C. E., Tekalp, A. M., and Sankur, B. (2003). Video

object tracking with feedback of performance mea-

sures. IEEE Trans. on Circuits and Systems for Video

Technology, 13(4):310–324.

Fu, Y., Erdem, A. T., and Tekalp, A. M. (2000). Track-

ing visible boundary of objects using occlusion adap-

tive motion snake. IEEE Trans. on Image Processing,

9(12):2051–2060.

Kass, M., Witkin, A., and Terzopoulos, D. (1987). Snakes:

Active contour models. International Journal of Com-

puter Vision, pages 321–331.

evin, M., Delmas, P., Coulon, P. Y., Luthon, F., and

Fristot, V. (1999). Automatic lip tracking: Bayesian

segmentation and active contours in a cooperative

scheme. In IEEE Int. Conf. on Multimedia Computing

and Systems (ICMCS’99), pages 691–696, Firenze,

Italy. Vol. 1.

evin, M. and Luthon, F. (2004). Nonlinear color space

and spatiotemporal MRF for hierarchical segmenta-

tion of face features in video. IEEE Transactions on

Image Processing, 13(1):63–71.

Mirmehdi, M., Palmer, P. L., Kitler, J., and Dabis, H.

(1999). Feedback control strategies for object recog-

nition. IEEE Trans. on Image Processing, 8(8):1084–

1101.

Pichler, O., Teuner, A., and Hosticka, B. J. (1998). An un-

supervised texture segmentation algorithm with fea-

ture space reduction and knowledge feedback. IEEE

Trans. on Image Processing, 7(1):53–61.

REAL-TIME LIPTRACKING FOR SYNTHETIC FACE ANIMATION WITH FEEDBACK LOOP

407