GEOMETRIC FACIAL MOTION SYNTHESIS - 3D Face Processing: Modeling, Analysis and Synthesis

Graphics Reference

In-Depth Information

ever, both Williams [Williams, 1990] and Guenter et al. [Guenter et al., 1998]

required intrusive markers be put on the face of the subject. This limits the

conditions where there approaches can be used. Other approaches [Eisert et al.,

2000, Essa and Pentland, 1997, Tao and Huang, 1999, Terzopoulos and Wa-

ters, 1990a] used more sophisticated facial motion analysis techniques to avoid

using markers. The quality of the animation depends on the estimated facial

motions. Therefore, the key issue is to achieve robust and accurate face motion

estimation from noisy image motion. However, it is still a challenging problem

to estimate facial motions accurately and robustly without using markers.

1.2

Text-driven face animation

Synthesizing facial motions during speech is useful in many applications,

such as e-commerce [Pandzic et al., 1999], computer-aided education [Cole

et al., 1999]. One type of input of this “visual speech” synthesis is text. First,

the text is converted into a sequence of phonemes by Text-To-Speech (TTS) sys-

tem. Then, the phoneme sequence is mapped to corresponding facial shapes,

called visemes. Finally, a smooth temporal facial shape trajectory is synthe-

sized considering the co-articulation effect in speech. It is combined with the

synthesized audio-only speech signal from TTS as the final animation. Waters

et al. [Waters and Levergood, 1993] and Hong et al. [Hong et al., 2001a] gener-

ated the temporal trajectory using sinusoidal interpolation functions. Pelachaud

et al. [Pelachaud et al., 1991] used a look-ahead co-articulation model. Cohen

and Massaro [Cohen and Massaro, 1993] adopted Löfqvist gestural production

model [Lofqvist, ] as the co-articulation model and interactively designed its

explicit form based on observation of real speech.

1.3

Speech-driven face animation

Another type of input for “visual speech” synthesis speech signals. For

speech-driven face animation, the main research issue is the audio-to-visual

mapping. The audio information is usually represented by acoustic features

such as linear predictive coding (LPC) cepstrum, Mel-frequency cepstral coef-

ficients (MFCC). The visual information is usually represented by the param-

eters of facial motion models, such as the weights of AU's, MPEG-4 FAPs,

the coordinates of model control points etc. The mappings are learned from an

audio-visual training data set, which are collected in the following way. The

facial movements of talking subjects are tracked either manually or automati-

cally. The tracking results and the associated audio tracks are collected as the

audio-visual training data.

Some speech-driven face animation approaches use phonemes or words as

intermediate representations. Lewis [Lewis, 1989] uses linear prediction to

recognize phonemes. The recognized phonemes are associated with mouth

Search WWH ::

Custom Search

Home