Graphics Reference
In-Depth Information
positions to provide key frames for face animation. However, the phoneme
recognition rate of linear prediction is low. Video Rewrite [Bregler et al., 1997]
trains hidden Markov models (HMMs) [Rabiner, 1989] to automatically label
phonemes in both training audio track and new audio track. It models short-
term mouth co-articulation using triphones. The mouth images for a new audio
track are generated by reordering the mouth images in the training footage,
which requires a very large database. Video Rewrite is an offline approach
and needs large computation resources. Chen and Rao [Chen and Rao, 1998]
train HMMs to segment the audio feature vectors of isolated words into state
sequences. Given the trained HMMs, the state probability for each time stamp
is evaluated using the Viterbi algorithm. The estimated visual features of all
states can be weighted by the corresponding probabilities to obtain the final
visual features, which are used for lip animation. The advantage of using
the intermediate representations is that people can make use of the knowledge
about speech recognition and the phoneme-to-visual mapping in text-driven
animation. The disadvantage is that it requires long enough context information
to recognize phoneme or words so that it can not achieve real-time speech-driven
face animation.
Another HMM-based approach tries to directly map audio patterns to facial
motion trajectories. Voice Puppetry [Brand, 1999] uses an entropy minimiza-
tion algorithm to train HMMs for the audio to visual mapping. The mapping
estimates a probability distribution over the manifold of possible facial mo-
tions from the audio stream. An advantage of this approach is that it does not
require automatically recognizing speech into high-level meaningful symbols
(e.g., phonemes, words), which is very difficult to obtain a high recognition
rate. Nevertheless, this approach is an offline method.
Other approaches attempt to generate instantaneous lip shapes directly from
each audio frame using vector quantization, Gaussian mixture model, or artifi-
cial neural networks (ANN). Vector quantization-based approach [Morishima
et al., 1989] classifies the audio features into one of a number of classes. Each
class is then mapped onto a corresponding visual output. Though it is computa-
tionally efficient, the vector quantization approach often leads to discontinuous
mapping results. The Gaussian mixture approach [Rao and Chen, 1996] mod-
els the joint probability distribution of the audio-visual vectors as a Gaussian
mixture. Each Gaussian mixture component generates an optimal estimation
for a visual feature given an audio feature. The estimations are then nonlinearly
weighted to produce the final visual estimation. The Gaussian mixture approach
produces smoother results than the vector quantization approach. However,
neither of these two approach described consider phonetic context information,
which is very important for modeling mouth coarticulation during speech. Neu-
ral network based approaches try to find nonlinear audio-to-visual mappings.
Morishima and Harashima [Morishima and Harashima, 1991] trained a three
Search WWH ::




Custom Search