Digital Signal Processing Reference
In-Depth Information
this study was collected under carefully controlled driving conditions, i.e.,
combinations of three car speeds (idle, driving in a city area and driving
on an expressway) and five car conditions (fan on (hi/lo), CD player
on, open window, and normal driving condition). For this part of the
corpus, 50 isolated word utterances of 20 speakers were recorded under
all combinations of driving speeds and vehicular conditions.
5. EXPERIMENTAL EVALUATIONS
5.1 Experimental Setup
Speech signals used in the experiments were digitized into 16 bits at
the sampling frequency of 16 kHz. For the spectral analysis, 24-channel
mel-filterbank analysis is performed by applying the triangular windows
on the FFT spectrum of the 25-ms-long windowed speech. This basic
analysis is realized through HTK standard MFB analysis [8]. The regres-
sion analysis is performed on the logarithm of MFB output. Since the
power of the in-car noise signal is concentrated in the lower frequency
region, the regression analysis is performed for the range of 250-8kHz,
i.e., to spectral channels of the MFB. Then DCT is executed to
convert the log-MFB feature vector into the MFCC vector for the speech
recognition experiments.
Three different HMMs are trained:
close-talking HMM is trained using the close-talking microphone speech,
distant microphone HMM is trained using the speech at the nearest
distant microphone, and
MRLS HMM is trained using MRLS results.
The regression weights optimized for each training sentence are used for
generating the training data of MRLS HMM.
The structure of the three HMMs is fixed, i.e., three-state triphones
based on 43 phonemes that share 1000 states; each state has 16-component
mixture Gaussian distributions; and the feature vector is a 25
vector. The total number
of training sentences is about 8,000.
2,000 of which were uttered while
driving and 6,000 in an idling car.
Search WWH ::




Custom Search