Digital Signal Processing Reference
In-Depth Information
speaker adaptation and normalization methods such as Maximum Likelihood
Linear Regression (MLLR), Vocal Tract Length Normalization (VTLN), and
cepstral mean & variance normalization. In addition advanced language-
modeling strategies such as concept language models are also incorporated
into the toolkit.
The training set includes 60 speakers balanced by age and gender, whereas
the test set employs 50 speakers which again are age and gender-balanced.
The word error rates (WER) and relative improvements of PMVDR with
respect to MFCC are summarized in Table 2-2.
The optimal settings for this task were found to be M = 24 and
(close to the Bark scale). The 36.1% reduction in error rate using PMVDR
features is a strong indicator of the robustness of these features in realistic
noisy environments. We tested these features on a number of other tasks
including clean, telephone and stressed speech and consistently obtain better
results than that for MFCCs. Therefore, we conclude that PMVDR is a better
acoustic front-end than MFCC for ASR in car environments.
4.3.5
Integration of Vocal Tract Length Normalization (VTLN)
VTLN is a well-known method of speaker normalization in which a
customized linear warping function in the form of in frequency
domain is used for each speaker [43]. The normalization factor, is a
number which is generally less than 1.0 for female speakers and more than
1.0 for male speakers to account for different average vocal tract lengths. The
normalization factor is determined by an exhaustive search as the one
maximizing the total likelihood of a speaker's data using specifically trained
models containing only 1 Gaussian for each phoneme cluster for a decision-
tree state clustered HMM setting. The VTLN integrated with PMVDR
Search WWH ::




Custom Search