Digital Signal Processing Reference
In-Depth Information
speaker. The text-dependent system, implemented in this project, is commonly
found in speaker verification systems in which a person's password is critical for
verifying his/her identity.
In the training phase , the feature vectors are used to create a model for each
speaker. During the testing phase , when the test feature vector is used, a number
will be associated with each speaker model indicating the degree of match with that
speaker's model. This is done for a set of feature vectors, and the derived numbers
can be used to find a likelihood score for each speaker's model. For the speaker ID
problem, the feature vectors of the test utterance are passed through all the speak-
ers' models and the scores are calculated. The model having the best score gives the
speaker's identity (which is the decision component).
This project uses MFCC for feature extraction, VQ for classification/training,
and the Euclidean distance between MFCC and the trained vectors (from VQ) for
speaker ID. Much of this project was implemented with MATLAB [47].
Mel-Frequency Cepstrum Coefficients
MFCCs are based on the known variation of the human ear's critical bandwidths.
A Mel-frequency scale is used with a linear frequency spacing below 1000 Hz and
a logarithmic spacing above that level. The steps used to obtain the MFCCs follow.
1. Level detection . The start of an input speech signal is identified based on a pre-
stored threshold value. It is captured after it starts and is passed on to the
framing stage.
2. Frame blocking . The continuous speech signal is blocked into frames of N
samples, with adjacent frames being separated by M ( M
N ). The first frame
consists of the first N samples. The second frame begins M samples after the
first frame and overlaps it by N
<
M samples. Each frame consists of 256
samples of speech signal, and the subsequent frame starts from the 100th
sample of the previous frame. Thus, each frame overlaps with two other sub-
sequent frames. This technique is called framing . The speech sample in one
frame is considered to be stationary.
3. Windowing . After framing, windowing is applied to prevent spectral leakage.
A Hamming window with 256 coefficients is used.
4. Fast Fourier transform . The FFT converts the time-domain speech signal into
a frequency domain to yield a complex signal. Speech is a real signal, but its
FFT has both real and imaginary components.
5. Power spectrum calculation . The power of the frequency domain is calculated
by summing the square of the real and imaginary components of the signal to
yield a real signal. The second half of the samples in the frame are ignored
since they are symmetric to the first half (the speech signal being real).
6. Mel-frequency wrapping . Triangular filters are designed using the Mel-
frequency scale with a bank of filters to approximate the human ear. The
-
Search WWH ::




Custom Search