Background in Speech Recognition - Hierarchical Neural Network Structures for Phoneme Recognition

Digital Signal Processing Reference

In-Depth Information

The next section describes in more detail the most common feature ex-

traction techniques. Additionally, it presents several feature space transfor-

mations, which use a large temporal context for improving performance.

Moreover, a nonlinear transformation technique based on artificial neural

networks (ANN) is explained in detail. Section 2.3 deals with acoustic mod-

eling based on HMMs. It is shown how the output of the neural networks

are used as emission probabilities, enhancing the HMM framework. This sys-

tem is denominated as hybrid HMM/ANN [Bourlard 94]. Furthermore, two

well-known systems which utilized Gaussian Mixture Modeling (GMM) and

discrete distributions for estimating emission probabilities are also presented,

known as HMM/GMM and discrete HMM, respectively.

2.1 Feature Extraction

The first step in a speech processing system is usually to convert the signal

from an analog to a digital form. In a typical speech decoder, a small window

of the signal spanning approximately 30 ms is extracted and further analyzed.

For the case of the ISDN standard, which correspond to a sampling rate

of 8 kHz at 8 bit, 30 ms of speech is equivalent to 240 samples. In order

to classify directly this small fraction of speech, a look-up table containing

(2 8 ) 240

10 578 different vectors and their probability of occurrence would

be required. For the case of 16 kHz at 16 bit systems, it would be necessary

a look-up table with (2 16 ) 480

≈

10 2312 entries. This fact makes the system

intractable. For this reason, it is necessary to reduce redundancy, and extract

the relevant information from the speech signal. This task is performed by

the feature extraction module.

Common feature extraction techniques are mainly based on models of

speech production. In this model, an excitation signal e 0 ( t ), representing the

air flow at the vocal cords is introduced to a linear time-varying filter h ( t ),

which represents the resonances of the vocal tract [Huang 01]. The speech

signal s ( t ) is obtained by convolution:

≈

s ( t )= e 0 ( t )

∗

h ( t )

(2.1)

The excitation signal contains information such as voicing, pitch period

and amplitude. On the other hand, information representing the sound being

pronounced (phone) is contained on the characteristics of the filter. Therefore,

typical feature extraction methods deliver mainly the parameters of the filter,

ignoring the excitation signal.

In the following, three common feature extraction techniques are presented.

They are based on speech production models such as Linear Predictive Cod-

ing (LPC) [Atal 68, Makhoul 73] and speech perception models such as Mel-

Frequency Cepstral Coecients (MFCC) [Davis 89] and Perceptual Linear

Prediction (PLP) [Hermansky 90]. Additionally, common techniques for us-

ing larger temporal information are explained.

Hierarchical Neural Network Structures for Phoneme Recognition

Search WWH ::

Custom Search

Home