which are modeled using a Gaussian model. Each key shape has a mean shape
and a covariance matrix. The key shape viseme models are then used in the
key-frame-based face animation, such as text-driven animation. Four of the key
shapes and the largest components of their variances are shown in Figure 5.3.
They correspond to phonemes: (a) M; (b) AA; (c) UH; and (d) EN. Because we
only use the relative ratio of the variance values, we normalize variance values
to be in the range [0,1].
4. Offline Speech-driven Face Animation
When human speech is used in one-way communication, e.g. news broad-
casting over the networks, using real speech in face animation is better than
using synthetic speech. Because the communication is one-way, the audio-
to-visual mapping can be done offline, i.e. the animation can come out until
the end of a batch of speech. The process of offline speech driven face ani-
mation is illustrated in Figure 5.4. An advantage of offline process is that the
phoneme transcription and timing information can be extracted accurately for
animation purpose, given the text script of the speech. Recognizing phoneme
