Hierarchical Approach and Downsampling Schemes - Hierarchical Neural Network Structures for Phoneme Recognition

Digital Signal Processing Reference

In-Depth Information

4

Hierarchical Approach and Downsampling

Schemes

A simple and successful phoneme recognizer in a hierarchical ANN framework

is proposed in [Pinto 08b]. In Section 3.4 we could observed that this method

compares favorably to hitherto approaches. In this scheme, phoneme poste-

riors are estimated by a two-level hierarchical structure. In the first level, a

MLP estimates intermediate phoneme posteriors based on a temporal win-

dow of cepstral features. In the second level, another MLP estimates final

phoneme posteriors based on a temporal window of intermediate posterior

features. The final phoneme posteriors are then input to a Viterbi decoder.

In [Pinto 08b, Pinto 09] it is shown that the hierarchical scheme outper-

forms considerably the no-hierarchical approach in phoneme and word recog-

nition tasks. In addition, in comparison with the first MLP, the second MLP

is able to process a larger temporal context, improving performance. How-

ever, the use of a second MLP in tandem highly increases computational time

and memory requirements. Therefore, the main goal in this chapter refers to

optimize this scheme under these considerations while keeping system accu-

racy. In the next chapter, an extension of the hierarchical scheme is presented

which goal corresponds to improve recognition accuracy.

In order to reduce computational time and/or number of parameters of the

hierarchical scheme, several downsampling schemes are investigated. These

schemes allow removing redundant information contained in the intermediate

posteriors. In this way, the system performance is not affected. In addition,

the highly decrease of computational time and number of parameters make the

system portable in a real-time application or embedded system [Vasquez 09d].

4.1 Hierarchical Approach

This work is based on the hierarchical structure described in [Pinto 08b]. As

mentioned above, it consists of two levels estimating posteriors as shown

in Fig. 4.1. In the first level ( feature level ), an MLP ( MLP 1) estimates

intermediate posterior probabilities x k,t of each of the n phonetic classes

Search WWH ::

Custom Search

Home