Background in Speech Recognition - Hierarchical Neural Network Structures for Phoneme Recognition

Digital Signal Processing Reference

In-Depth Information

=

k

∂E n

∂a j

∂E n

∂a k

∂a j

δ j ≡

(2.37)

where the subindices indicates the branches of the network connecting the

unit j with unit k .

For the output unit, the error is given by:

δ k = y k −

t k

(2.38)

and from (2.32) and (2.37), the error at any unit of the network can be

expressed as:

δ j = h ( a j )

k

w jk δ k

(2.39)

Notice that the previous analysis has been given, independent on the num-

ber of layers and the type of nonlinear activation functions. However, as it

was mentioned, the activation function is required to be differentiable. Once

the derivatives are calculated based on (2.36), the weights can be updated

by (2.31).

2.3 Acoustic Models

In the previous section we have observed the most common feature extraction

techniques which correspond to the front-end of a speech decoder, shown in

Fig. 2.1. This section presents the fundamentals of the statistical decoder

which is commonly based on Hidden Markov Models (HMM) [Baum 67]. The

HMM is probably the most powerful statistical method for modeling speech

signals. They can characterize observed data samples such as a sequence of

feature vectors with variable time length for pattern classifications. This task

is performed eciently by introducing the dynamic programming principle.

The HMM assumes that the observed samples are generated by a parametric

random process and it provides a well-defined framework for estimating the

parameters of the stochastic process [Huang 01].

The general assumption of the speech decoder is that the message carried

in the speech signal is encoded as a sequence of symbols. In the previous sec-

tion, we have observed that the front-end of the recognizer extracts relevant

information from the speech signal and embeds it in feature vectors. Then,

the task of the statistical decoder is to map the sequence of feature vectors

to the sequence of symbols.

The statistical decoder has to deal in particular with two problems. First,

as it has been mentioned, feature vectors are extracted with a fixed rate,

typically each 10 ms. Therefore, the sequence length of the feature vectors

depends on the length of the current speech signal. However, several speech

signals with different lengths may carry the same message depending on sev-

eral factors such as speech rate, speaker mood, etc. As a consequence, there

Hierarchical Neural Network Structures for Phoneme Recognition

Search WWH ::

Custom Search

Home