synthesis framework. In this section‚ we propose a local non-linear audio-to-
visual mapping based on ANN. We first classify audio features into groups.
Then for each group we train an ANN for audio-to-visual mapping. In this
way‚ the mapping can be more robust and accurate than simple classification
based methods‚ such as VQ [Morishima et al.‚ 1989] and GMM [Rao and Chen‚
1996]. Our multi-ANN-based mapping is also more efficient in training than
methods using only a single ANN [Morishima and Yotsukura‚ 1999‚ Kshirsagar
and Magnenat-Thalmann‚ 2000‚ Massaro and et al.‚ 1999‚ Lavagetto‚ 1995].
5.2.1 Training data and features extraction
We use the facial motion capture database (described in Section 2 of Chap-
ter 3) along with its audio track for learning audio-to-visual mapping. To reduce
the complexity of learning and make it more robust‚ the visual feature space
should be small. Thus for this specific application we use the holistic MUs
(Section 3 of Chapter 3) as the visual representation. For each 33 ms short
time window‚ we calculate MUPs as the visual features and calculate twelve
Mel-frequency cepstrum coefficients (MFCCs) [Rabiner and Juang‚ 1993] as
the audio features. The audio feature vectors of frames
and are concatenated in the temporal order as the final audio
feature vector of frame Consequently‚ the audio feature vector of each audio
frame has eighty-four elements. The frames
define the contextual information of the frame
5.2.2 Audio-to-visual mapping
We modify the approaches that train neural networks as the audio-to-visual
mapping [Hong et al.‚ 2001b‚ Morishima and Yotsukura‚ 1999‚ Massaro and
et al.‚ 1999]. The training audio-visual data is divided into twenty-one groups
based on the audio feature of each data sample. The number twenty-one is
decided heuristically based on audio feature distribution of the training database.
Particularly‚ one of the groups corresponds to silence because human beings
are very sensitive to mouth movements if there is no sound generated. Other
twenty groups are automatic generated using the k-mean algorithm. Then‚ the
audio features of each group are modelled by a Gaussian model. After that‚ a
three-layer perceptron is trained to map the audio features to the visual features
using each audio-visual data group. At the estimation phase‚ we first classify an
audio vector into one of the audio feature groups whose Gaussian model gives
the highest score for the audio feature vector. We then select the corresponding
neural network to map the audio feature vector to MUPs‚ which can be used in
equation 3.1 to synthesize the facial shape. A method using triangular average
window is used to smooth the jerky mapping results.
For each group‚ eighty percent of the data is randomly selected for training.
The remaining data is used for testing. The maximum number of the hidden