Modeling and Synthesis of Realistic Visual Speech in 3D - 3D Modeling and Animation: Synthesis and Analysis Techniques for the Human Body

Game Development Reference

In-Depth Information

(Kalberer et al., 2001; Kalberer et al., 2002a) and by Kshirsagar (2001), but for

fewer points on the face. Moreover, their Viseme Spaces where based on PCA

(Principal Component Analysis), not ICA. A justification for using ICA rather

than PCA is to follow later.

Straightforward point-to-point navigation as a way of concatenating visemes

would yield jerky motions. Moreover, when generating the temporal samples,

these may not precisely coincide with the pace at which visemes change. Both

problems are solved by fitting splines to the Viseme Space coordinates of the

visemes. This yields smoother changes and allows us to interpolate in order to

get the facial expressions needed at the fixed times of subsequent frames. We

used NURBS curves of order three.

A word on the implementation of co-articulation effects is in order here. A

distinction is made between vocals and labial consonants on the one hand, and

the remainder of the visemes on the other. The former impose their deformations

much more strictly onto the animation than the latter, which can be pronounced

with a lot of visual variation. In terms of the spline fitting, this means that the

animation trajectory will move precisely through the former visemes and will only

be attracted towards the latter. Figure 13 illustrates this for one Viseme Space

coordinate.

Initially a spline is fitted through the values of the corresponding component for

the visemes of the former category. Then, its course is modified by bending it

towards the coordinate values of the visemes in the latter category. This second

category is subdivided into three subcategories: (1) somewhat labial consonants

like those corresponding to the /ch,jh,sh,zh/ viseme pull stronger than (2) the

viseme /f,v/ , which in turn pulls stronger than (3) the remaining visemes of the

second category. In all three cases the same influence is given to the rounded

and widened versions of these visemes. The distance between the current spline

(determined by vocals and labial consonants) and its position if it had to go

through these visemes is reduced to (1) 20%, (2) 40%, and (3) 70%, respectively.

These are also shown in Figure 13. These percentages have been set by

comparing animations against 3D ground-truth. If an example face is animated

with the same audio track used for training, such comparison can be easily made

and deviations could be minimized by optimizing these parameters. Only dis-

tances between lip positions were taken account of so far.

Modifications by the Animator

A tool that automatically generates a face animation which the animator then has

to take or leave is a source of frustration, rather than a help. The computer cannot

replace the creative component that the human expert brings to the animation

Search WWH ::

Custom Search

Home