views even improve performances for certain phonemes. But on the aver-
age, frontal views are still the best. Finally, the performances remain fairly
good, when the distance from the synthetic talker to the perceiver is within
Other researchers also reported that speechreading is robust when dynamics
of visual speech is changed. IJsseldijk [IJsseldijk, 1992] found that perfor-
mance speechreading degraded little even when the temporal sequence is
slowed down four times. Williams and et al. [Williams et al., 1997] reported
that the recognition rates of visemes degraded less than 5%, even when the
sampling rate of visual speech was only 5 ~ 10 frame per second.
It is also reported that adding extra visual features could improve speech
reading after a few rounds of training. For example, in their experiments
color bars were used to show whether the current sound is “nasal”, “voicing”
or “friction”. The bars were displayed besides the talking face. The results
showed that speech reading correct rates improved significantly after pre-
senting the material five times to the perceivers. It implies that perceivers
could adapt to the extra visual features in a fairly short period of time.
Although “Baldi” has been shown to be a useful animation tools for research
and applications in speechreading, it has the following limitations:
Only macrostructure level geometric motions are modeled in current sys-
tem. Therefore, it loses important visual cues such as the shading changes
of the lips and area within the mouth. These visual cues are important to
perceive lip gestures such as rounding and retraction, and relative positions
of articulators such as teeth and tongue. As a result, subtle phonemes (e.g.
/ R /) are more easily confused. To deal with this problem, one possible way
is to use significantly more detailed geometry and advanced rendering to
reproduce these subtle appearance changes. Modeling detailed geometric
motion of mouth can be very expensive, because it involves complex wrin-
kles, surface discontinuities and non-rigid self collisions. Furthermore, the
mouth interior is difficult to measure. For real-time rendering, it is also
expensive to model the diverse material properties of lip, teeth and tongue
and perform ray-tracing. Therefore, modeling these visual cues as texture
variation is more feasible for speech perception applications.
“Baldi” is a complex animation system with a great number of parameters.
For basic tongue movement alone, there are more then 30 parameters. If
a user wants to create customized face shapes for applications, it would
be difficult and time-consuming unless the user has in-depth knowledge of
“Baldi”. Therefore, it is desired that systematic approaches be developed
to simplify the use of the animation system. For example, one possible