GEOMETRIC FACIAL MOTION SYNTHESIS - 3D Face Processing: Modeling, Analysis and Synthesis

Graphics Reference

In-Depth Information

positions to provide key frames for face animation. However, the phoneme

recognition rate of linear prediction is low. Video Rewrite [Bregler et al., 1997]

trains hidden Markov models (HMMs) [Rabiner, 1989] to automatically label

phonemes in both training audio track and new audio track. It models short-

term mouth co-articulation using triphones. The mouth images for a new audio

track are generated by reordering the mouth images in the training footage,

which requires a very large database. Video Rewrite is an offline approach

and needs large computation resources. Chen and Rao [Chen and Rao, 1998]

train HMMs to segment the audio feature vectors of isolated words into state

sequences. Given the trained HMMs, the state probability for each time stamp

is evaluated using the Viterbi algorithm. The estimated visual features of all

states can be weighted by the corresponding probabilities to obtain the final

visual features, which are used for lip animation. The advantage of using

the intermediate representations is that people can make use of the knowledge

about speech recognition and the phoneme-to-visual mapping in text-driven

animation. The disadvantage is that it requires long enough context information

to recognize phoneme or words so that it can not achieve real-time speech-driven

face animation.

Another HMM-based approach tries to directly map audio patterns to facial

motion trajectories. Voice Puppetry [Brand, 1999] uses an entropy minimiza-

tion algorithm to train HMMs for the audio to visual mapping. The mapping

estimates a probability distribution over the manifold of possible facial mo-

tions from the audio stream. An advantage of this approach is that it does not

require automatically recognizing speech into high-level meaningful symbols

(e.g., phonemes, words), which is very difficult to obtain a high recognition

rate. Nevertheless, this approach is an offline method.

Other approaches attempt to generate instantaneous lip shapes directly from

each audio frame using vector quantization, Gaussian mixture model, or artifi-

cial neural networks (ANN). Vector quantization-based approach [Morishima

et al., 1989] classifies the audio features into one of a number of classes. Each

class is then mapped onto a corresponding visual output. Though it is computa-

tionally efficient, the vector quantization approach often leads to discontinuous

mapping results. The Gaussian mixture approach [Rao and Chen, 1996] mod-

els the joint probability distribution of the audio-visual vectors as a Gaussian

mixture. Each Gaussian mixture component generates an optimal estimation

for a visual feature given an audio feature. The estimations are then nonlinearly

weighted to produce the final visual estimation. The Gaussian mixture approach

produces smoother results than the vector quantization approach. However,

neither of these two approach described consider phonetic context information,

which is very important for modeling mouth coarticulation during speech. Neu-

ral network based approaches try to find nonlinear audio-to-visual mappings.

Morishima and Harashima [Morishima and Harashima, 1991] trained a three

Search WWH ::

Custom Search

Home