Digital Signal Processing Reference
In-Depth Information
The next section describes in more detail the most common feature ex-
traction techniques. Additionally, it presents several feature space transfor-
mations, which use a large temporal context for improving performance.
Moreover, a nonlinear transformation technique based on artificial neural
networks (ANN) is explained in detail. Section 2.3 deals with acoustic mod-
eling based on HMMs. It is shown how the output of the neural networks
are used as emission probabilities, enhancing the HMM framework. This sys-
tem is denominated as hybrid HMM/ANN [Bourlard 94]. Furthermore, two
well-known systems which utilized Gaussian Mixture Modeling (GMM) and
discrete distributions for estimating emission probabilities are also presented,
known as HMM/GMM and discrete HMM, respectively.
2.1 Feature Extraction
The first step in a speech processing system is usually to convert the signal
from an analog to a digital form. In a typical speech decoder, a small window
of the signal spanning approximately 30 ms is extracted and further analyzed.
For the case of the ISDN standard, which correspond to a sampling rate
of 8 kHz at 8 bit, 30 ms of speech is equivalent to 240 samples. In order
to classify directly this small fraction of speech, a look-up table containing
(2 8 ) 240
10 578 different vectors and their probability of occurrence would
be required. For the case of 16 kHz at 16 bit systems, it would be necessary
a look-up table with (2 16 ) 480
10 2312 entries. This fact makes the system
intractable. For this reason, it is necessary to reduce redundancy, and extract
the relevant information from the speech signal. This task is performed by
the feature extraction module.
Common feature extraction techniques are mainly based on models of
speech production. In this model, an excitation signal e 0 ( t ), representing the
air flow at the vocal cords is introduced to a linear time-varying filter h ( t ),
which represents the resonances of the vocal tract [Huang 01]. The speech
signal s ( t ) is obtained by convolution:
s ( t )= e 0 ( t )
h ( t )
(2.1)
The excitation signal contains information such as voicing, pitch period
and amplitude. On the other hand, information representing the sound being
pronounced (phone) is contained on the characteristics of the filter. Therefore,
typical feature extraction methods deliver mainly the parameters of the filter,
ignoring the excitation signal.
In the following, three common feature extraction techniques are presented.
They are based on speech production models such as Linear Predictive Cod-
ing (LPC) [Atal 68, Makhoul 73] and speech perception models such as Mel-
Frequency Cepstral Coecients (MFCC) [Davis 89] and Perceptual Linear
Prediction (PLP) [Hermansky 90]. Additionally, common techniques for us-
ing larger temporal information are explained.
Search WWH ::




Custom Search