Audio Recognition - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

LSTM networks can be trained by BPTT. They have shown remarkable perfor-

mance in a variety of pattern recognition tasks, including phoneme classification

[ 27 ], handwriting recognition [ 28 ], keyword spotting [ 29 ], affective computing [ 30 ],

and driver distraction detection [ 31 ]. Combining bidirectional networks with LSTM

leads to bidirectional LSTM (BLSTM). Further details on the LSTM technique can

be found in [ 28 ].

7.3 Dynamic Learning Algorithms

Audio is sequential, and an endpointed audio stream X

accord-

ingly yields a series of T feature vectors. So far, however, we mostly dealt with

classification of single feature vectors without use of temporal context. One excep-

tion were the different types of RNN that modelled such context, as discussed above.

But even these are not able to 'warp' in time, i.e., to handle different tempo deviations

between, e.g., two musical pieces or stretching or shortening, e.g., of vowels while

speaking. The most frequently encountered algorithm for audio sequence classifica-

tion are HMMs [ 32 ] as a simple form of DBNs. This property is owed to their ability

of dynamic modelling throughout different hierarchy levels and a well-formulated

stochastic framework. In ASR, for example, the extracted feature stream is first mod-

elled on the phoneme level. On a higher level, these phonemes are then used to form

words. Each class i is modelled by a HMM that represents the probability P

={

x 1 ,

x 2 ,...,

x T }

(

|

)

X

i

,

where X is called the 'observation', which is generated by the HMM.

A Markov model can be seen as finite state automaton that may change its state

at any step in time. In a HMM, at each step in time t a feature vector x t is being

generated depending on the current state s and the emission probability b s (

.The

probability of a transition from state j to state k is expressed by the state transition

probability a j , k [ 33 ]. The probabilities a 0 , j are needed to enter the model in a state

j with a certain probability. In order to simplify calculation, a non-emitting initial

state s 0 and a non-emitting final state s F can be defined [ 1 ]. In Fig. 7.12 the structure

of such a model is depicted. In the example, the most frequently used type of HMM

for audio processing is depicted—the so-called left-right model. In this model type,

the state number cannot decrease over time. In the 'linear' model, no state can be

skipped. Other topologies allow for a state skip, such as the Bakis model in which

one state may be skipped. If any state can be reached from any other state with a

probability above zero, the topology is referred to as 'ergodic'.

One speaks of a 'hidden' Markov model, as the sequence of states remains

unknown—only the observations sequence is known [ 32 ]. Note the 'Markov prop-

erty' that the conditional probability distribution of the hidden variable s

x

)

at time

step t , given the values of the hidden variable s at all times, depends only on the

hidden variable s

(

t

)

(

t

−

1

)

, i.e., values at earlier steps in time have no influence [ 34 ].

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home