Digital Signal Processing Reference
In-Depth Information
signal x ( n ) contained only music, whichwas randomly chosen from six files of varying
musical styles. In the other case, x ( n ) consisted of speech files, which were randomly
chosen from 96 speech files taken from the English subset of the NTT-AT Multilin-
gual database; the files were spoken by four female and four male speakers. In
addition, a beep signal at 2.1-2.4 kHz was added to all loudspeaker source signals
0.25 s after the virtual PTS event. In the baseline reference case, however, no beep was
added because we assumed strict muting of the loudspeakers. To obtain the simulated
echo signals d ( n ), the loudspeaker source signals were convolved with a time-invari-
ant LEM system impulse response measured in a Volkswagen Passat car type.
For simulating the background noise component n ( n ), four different vehicle
noise files recorded in two different cars at two different velocities were used
randomly.
Noise and echo components were added to the test speech signals at different
signal-to-noise ratios (SNRs) and signal-to-echo ratios (SERs), respectively. By
this means, we were able to investigate the system behavior under varying distur-
bance conditions. As in [ 3 ], we performed the SNR and SER adjustment based on
the active speech level (ASL) according to ITU-T recommendation P.56 [ 8 ].
However, all signals were subject to a 50-7,000 Hz band-pass filter prior to the
P.56 level measurement to eliminate speech-irrelevant frequency components.
7.5.3 Automatic Speech Recognition Setup
The ASR experiments were conducted using a feature extraction frontend for mel-
frequency cepstral coefficients (MFCCs) and a set of hidden Markov models
(HMMs) trained on American English connected-digit strings.
The frontend settings were as follows: A pre-emphasis value of 0.9, a frame shift of
10 ms, a frame length of 25.6 ms, a Hamming window, and a 512-point FFT. No noise
reduction was applied in the frontend, but the HMMs were trained on recordings
containing slight vehicle noise. For each frame, twelve MFCCs (without the zeroth
coefficient) were computed using 26 uniform, triangular filterbank channels on the
mel scale and ignoring frequencies below 50 Hz and above 7 kHz. A log energy
coefficient as well as first and second order time derivatives were appended. Cepstral
mean normalization was performed separately for each utterance.
For acoustic modeling, we employed 42 tied-state HMMs representing
acoustic-phonetic units, differentiating also by the immediate left and right context
via triphone modeling within words. Each HMM consisted of one to three emitting
states, each of which was assigned a continuous output probability density function
modeled by a Gaussian mixture model with 32 components each. Diagonal covari-
ance matrices were assumed. The training material consisted of 3,325 utterances
spoken by 245 speakers and was taken from the connected-digit corpus of the
US-English SpeechDat-Car database [ 7 ]; to ensure speaker independence, two
disjunct sets of speakers were used for training and testing.
Search WWH ::




Custom Search