Game Development Reference
In-Depth Information
can used to extract visual features for audio-visual speech recognition and
emotion recognition (Cohen et al., 2002). In medical applications related to facial
motion disorder, such as facial paralysis, visual cues are important for both
diagnosis and treatment. Therefore, the facial motion analysis method can be
used as a diagnostic tool such as in Wachtman et al. (2001). Compared to other
3D non-rigid facial motion tracking approaches using a single camera, the
features of our tracking system include: (1) the deformation space is learned
automatically from data such that it avoids manual adjustments; (2) it is real-time
so that it can be used in real-time applications; and (3) it is able to recover from
temporary loss of tracking by incorporating a template-matching-based face
detection module.
Real-Time Speech-Driven 3D Face
Animation
In this section, we present the real-time speech-driven 3D face animation
algorithm in our 3D face analysis and synthesis framework. We use the facial
motion capture database used for learning MUs along with its audio track for
learning audio-to-visual mapping. For each 33 ms window, we calculate the
holistic MUPs as the visual features and 12 Mel-frequency cepstrum coeffi-
cients (MFCCs) (Rabiner & Juang, 1993) as the audio features. To include
contextual information, the audio feature vectors of frames t-3, t-2, t-1, t, t+1 ,
t+2 , and t+3 , are concatenated as the final audio feature vector of frame t .
The training audio-visual data is divided into 21 groups based on the audio feature
of each data sample. The number 21 is decided heuristically based on audio
feature distribution of the training database. One of the groups corresponds to
silence. The other 20 groups are automatic generated using the k-means
algorithm. Then, the audio features of each group are modeled by a Gaussian
model. After that, a three-layer perceptron is trained to map the audio features
to the visual features using each audio-visual data group. At the estimation phase,
we first classify an audio vector into one of the audio feature groups whose
Gaussian model gives the highest score for the audio feature vector. We then
select the corresponding neural network to map the audio feature vector to
MUPs, which can be used in equation (1) to synthesize the facial shape. A
method using triangular average window is used to smooth the jerky mapping
results. For each group, 80% of the data is randomly selected for training and
20% for testing. The maximum and minimum number of the hidden neurons is 10
and 4, respectively. A typical estimation result is shown in Figure 10. The
horizontal axes in the figure represent time. The vertical axes represent the
Search WWH ::




Custom Search