Digital Signal Processing Reference
In-Depth Information
requires two consecutive warpings; one for VTLN and one for incorporation
of perceptual considerations.
In the PMVDR formulation, we used a first order system to perform
perceptual warping. This warping function can also be used for speaker
normalization in which the system parameter is adjusted to each speaker [44].
Rather than performing two consecutive warpings, we could simply change
the degree of warping, (i.e., ), specifically for every speaker. This will
enable us to perform both VTLN and perceptual warping using a single warp.
The estimation of the VTLN-normalizing can be done the same way as
Such an integration of VTLN into the PMVDR framework yields an acoustic
front-end with built-in speaker normalization (BISN). Table 2-3 summarizes
our results with the conventional VTLN and BISN in the PMVDR
framework.
The BISN yields comparable results to VTLN with a less complex front-
end structure hence is an applicable speaker normalization method in ASR.
The total WER reduction compared to the MFCC baseline is around 50%
using PMVDR with BISN. The average warping factor for females was
and for males Females require less warping than males due
to shorter vocal tract length which conforms to VTLN literature.
Finally, experiments here were conducted on raw speech obtained from
one microphone in our array. Using array processing techniques discussed in
Sec. 4.1 and integrating the noise information obtained using techniques
discusses in Sec. 4.2 will boost performance considerably when used in
cascade with the robust acoustic front-end (PMVDR) and built-in speaker
normalization (BISN). It is also possible and feasible to apply noise
adaptation techniques such as Jacobian adaptation and speaker adaptation
techniques such as MLLR to further improve performance[28]. Front-end
speech enhancement schemes before acoustic feature extraction was also
found to be useful in improving performance [28].
Search WWH ::




Custom Search