Digital Signal Processing Reference
In-Depth Information
Input speech
analog
Sampling
digital
Framing/
blocking
Windowing
Code Word
FFT
(Converstion to
frequency
domain)
Computing
code vector
using VQ
Computing mel
frequency
coefficients
FIGURE 10.53. Steps for speaker recognition implementation.
(c) Use FFT to convert each frame from time to frequency domain.
(d) Convert the resulting spectrum into a Mel-frequency scale.
(e) Convert the Mel spectrum back to the time domain.
2. Classification consists of models for each speaker and a decision logic neces-
sary to render a decision. This module classifies extracted features according
to the individual speakers whose voices have been stored. The recorded voice
patterns of the speakers are used to derive a classification algorithm. Vector
quantization (VQ) is used. This is a process of mapping vectors from a large
vector space to a finite number of regions in that space. Each region is called
a cluster and can be represented by its center, called a codeword . The collec-
tion of all clusters is a codebook . In the training phase, a speaker-specific VQ
codebook is generated for each known speaker by clustering his/her training
acoustic vectors. The distance from a vector to the closest codeword of a code-
book is called a VQ distortion . In the recognition phase, an input utterance of
an unknown voice is vector-quantized using each trained codebook, and the
total VQ distortion is computed. The speaker corresponding to the VQ code-
book with the smallest total distortion is identified.
Speaker recognition can be classified with identification and verification. Speaker
identification is the process of determining which registered speaker provides a
given utterance. Speaker verification is the process of accepting or rejecting the iden-
tity claim of a speaker. This project implements only the speaker identification (ID)
process. The speaker ID process can be further subdivided into closed set and open
set . The closed set speaker ID problem refers to a case where the speaker is known
a priori to belong to a set of M speakers. In the open set case, the speaker may be
out of the set and, hence, a “none of the above” category is necessary. In this project,
only the simpler closed set speaker ID is used.
Speaker ID systems can be either text-independent or text-dependent . In the text-
independent case, there is no restriction on the sentence or phrase to be spoken,
whereas in the text-dependent case, the input sentence or phrase is indexed for each
Search WWH ::




Custom Search