Graphics Reference
In-Depth Information
ever, both Williams [Williams, 1990] and Guenter et al. [Guenter et al., 1998]
required intrusive markers be put on the face of the subject. This limits the
conditions where there approaches can be used. Other approaches [Eisert et al.,
2000, Essa and Pentland, 1997, Tao and Huang, 1999, Terzopoulos and Wa-
ters, 1990a] used more sophisticated facial motion analysis techniques to avoid
using markers. The quality of the animation depends on the estimated facial
motions. Therefore, the key issue is to achieve robust and accurate face motion
estimation from noisy image motion. However, it is still a challenging problem
to estimate facial motions accurately and robustly without using markers.
Text-driven face animation
Synthesizing facial motions during speech is useful in many applications,
such as e-commerce [Pandzic et al., 1999], computer-aided education [Cole
et al., 1999]. One type of input of this “visual speech” synthesis is text. First,
the text is converted into a sequence of phonemes by Text-To-Speech (TTS) sys-
tem. Then, the phoneme sequence is mapped to corresponding facial shapes,
called visemes. Finally, a smooth temporal facial shape trajectory is synthe-
sized considering the co-articulation effect in speech. It is combined with the
synthesized audio-only speech signal from TTS as the final animation. Waters
et al. [Waters and Levergood, 1993] and Hong et al. [Hong et al., 2001a] gener-
ated the temporal trajectory using sinusoidal interpolation functions. Pelachaud
et al. [Pelachaud et al., 1991] used a look-ahead co-articulation model. Cohen
and Massaro [Cohen and Massaro, 1993] adopted Löfqvist gestural production
model [Lofqvist, ] as the co-articulation model and interactively designed its
explicit form based on observation of real speech.
Speech-driven face animation
Another type of input for “visual speech” synthesis speech signals. For
speech-driven face animation, the main research issue is the audio-to-visual
mapping. The audio information is usually represented by acoustic features
such as linear predictive coding (LPC) cepstrum, Mel-frequency cepstral coef-
ficients (MFCC). The visual information is usually represented by the param-
eters of facial motion models, such as the weights of AU's, MPEG-4 FAPs,
the coordinates of model control points etc. The mappings are learned from an
audio-visual training data set, which are collected in the following way. The
facial movements of talking subjects are tracked either manually or automati-
cally. The tracking results and the associated audio tracks are collected as the
audio-visual training data.
Some speech-driven face animation approaches use phonemes or words as
intermediate representations. Lewis [Lewis, 1989] uses linear prediction to
recognize phonemes. The recognized phonemes are associated with mouth
Search WWH ::

Custom Search