DSP Applications and Student Projects - Digital Signal Processing and Applications with the C6713 and C6416 DSK

Digital Signal Processing Reference

In-Depth Information

speaker. The text-dependent system, implemented in this project, is commonly

found in speaker verification systems in which a person's password is critical for

verifying his/her identity.

In the training phase , the feature vectors are used to create a model for each

speaker. During the testing phase , when the test feature vector is used, a number

will be associated with each speaker model indicating the degree of match with that

speaker's model. This is done for a set of feature vectors, and the derived numbers

can be used to find a likelihood score for each speaker's model. For the speaker ID

problem, the feature vectors of the test utterance are passed through all the speak-

ers' models and the scores are calculated. The model having the best score gives the

speaker's identity (which is the decision component).

This project uses MFCC for feature extraction, VQ for classification/training,

and the Euclidean distance between MFCC and the trained vectors (from VQ) for

speaker ID. Much of this project was implemented with MATLAB [47].

Mel-Frequency Cepstrum Coefficients

MFCCs are based on the known variation of the human ear's critical bandwidths.

A Mel-frequency scale is used with a linear frequency spacing below 1000 Hz and

a logarithmic spacing above that level. The steps used to obtain the MFCCs follow.

1. Level detection . The start of an input speech signal is identified based on a pre-

stored threshold value. It is captured after it starts and is passed on to the

framing stage.

2. Frame blocking . The continuous speech signal is blocked into frames of N

samples, with adjacent frames being separated by M ( M

N ). The first frame

consists of the first N samples. The second frame begins M samples after the

first frame and overlaps it by N

<

M samples. Each frame consists of 256

samples of speech signal, and the subsequent frame starts from the 100th

sample of the previous frame. Thus, each frame overlaps with two other sub-

sequent frames. This technique is called framing . The speech sample in one

frame is considered to be stationary.

3. Windowing . After framing, windowing is applied to prevent spectral leakage.

A Hamming window with 256 coefficients is used.

4. Fast Fourier transform . The FFT converts the time-domain speech signal into

a frequency domain to yield a complex signal. Speech is a real signal, but its

FFT has both real and imaginary components.

5. Power spectrum calculation . The power of the frequency domain is calculated

by summing the square of the real and imaginary components of the signal to

yield a real signal. The second half of the samples in the frame are ignored

since they are symmetric to the first half (the speech signal being real).

6. Mel-frequency wrapping . Triangular filters are designed using the Mel-

frequency scale with a bank of filters to approximate the human ear. The

-

Digital Signal Processing and Applications with the C6713 and C6416 DSK

Search WWH ::

Custom Search

Home