Phoneme Recognition Task - Hierarchical Neural Network Structures for Phoneme Recognition

Digital Signal Processing Reference

In-Depth Information

a hierarchical structure where the output of a first level classifier based on

neural networks are postprocessing and then sent to a second level classi-

fier based on HMM/GMM. The classifier situated at the first level consists

of a MLP which takes as input a window of consecutive cepstral features

for estimating phoneme posterior probabilities. The MLP outputs are then

postprocessed by a logarithm function or simply by removing the output

softmax nonlinearity function. Then, the features are “gaussianized” by a

Karhunen-Loeve (KL) liner transformation, or more complex transformations

such as HLDA [Zhu 04], given the skewed distribution at the output of the

postprocessing. This transformation together with a possible dimensionality

reduction makes the features more suitable to be modeled by Gaussian mix-

tures. Finally, the output of the linear transformation are used as observation

features in a HMM/GMM classifier.

This method has shown promising results in the task of word recogni-

tion [Hermansky 00] and phoneme recognition [Fosler-Lussier 08] in com-

parison to standard HMM/GMM based on cepstral features. Additionally,

evaluations of the Tandem technique have been carried out in [Sivadas 02]

where the MLP has been replaced by a hierarchical MLP structure. However,

some drawbacks of the tandem technique are mentioned in [Aradilla 08] where

the potential of the discriminative features at the output of the MLP can be

lost during the postprocessing step.

Some other attempts to combine discriminative and generative models are

explored in [Pinto 08a]. Opposite to tandem technique, likelihoods estimated

by a HMM/GMM based on cepstral features are input to a MLP that esti-

mates phoneme posterior probabilities. This system is evaluated in a phoneme

recognition task based on the hybrid HMM/ANN approach. Additionally, in

the same work, more complex hierarchical structures are evaluated, aiming to

combine likelihoods estimated by GMM and posterior probabilities estimated

by MLPs.

3.5.2 Conditional Random Fields

Recently, Conditional Random Fields (CRFs) have gained interest in the field

of ASR given its successful adaptation as statistical models in the speech

recognition process. In fact, it is shown in [Gunawardana 05] that CRFs can

be seen as a generalization of HMMs, where the transition probabilities are

estimated by feature functions depending on the entire observation sequence

and the transition time instance. In addition, further theoretical advantages

of CRFs over HMMs can be referred to a discriminative training criterion, no

assumptions about the interdependence of consecutive observed features and

the capacity of estimating negative evidences [Gunawardana 05, Lafferty 01,

Fosler-Lussier 08].

A Conditional Random Field estimates the posterior probability of a la-

bel sequence q 1: T given the entire observation sequence o 1: T i.e., p ( q 1: T |

o 1: T )

Hierarchical Neural Network Structures for Phoneme Recognition

Search WWH ::

Custom Search

Home