Digital Signal Processing Reference
In-Depth Information
LSTM networks can be trained by BPTT. They have shown remarkable perfor-
mance in a variety of pattern recognition tasks, including phoneme classification
[ 27 ], handwriting recognition [ 28 ], keyword spotting [ 29 ], affective computing [ 30 ],
and driver distraction detection [ 31 ]. Combining bidirectional networks with LSTM
leads to bidirectional LSTM (BLSTM). Further details on the LSTM technique can
be found in [ 28 ].
7.3 Dynamic Learning Algorithms
Audio is sequential, and an endpointed audio stream X
accord-
ingly yields a series of T feature vectors. So far, however, we mostly dealt with
classification of single feature vectors without use of temporal context. One excep-
tion were the different types of RNN that modelled such context, as discussed above.
But even these are not able to 'warp' in time, i.e., to handle different tempo deviations
between, e.g., two musical pieces or stretching or shortening, e.g., of vowels while
speaking. The most frequently encountered algorithm for audio sequence classifica-
tion are HMMs [ 32 ] as a simple form of DBNs. This property is owed to their ability
of dynamic modelling throughout different hierarchy levels and a well-formulated
stochastic framework. In ASR, for example, the extracted feature stream is first mod-
elled on the phoneme level. On a higher level, these phonemes are then used to form
words. Each class i is modelled by a HMM that represents the probability P
={
x 1 ,
x 2 ,...,
x T }
(
|
)
X
i
,
where X is called the 'observation', which is generated by the HMM.
A Markov model can be seen as finite state automaton that may change its state
at any step in time. In a HMM, at each step in time t a feature vector x t is being
generated depending on the current state s and the emission probability b s (
.The
probability of a transition from state j to state k is expressed by the state transition
probability a j , k [ 33 ]. The probabilities a 0 , j are needed to enter the model in a state
j with a certain probability. In order to simplify calculation, a non-emitting initial
state s 0 and a non-emitting final state s F can be defined [ 1 ]. In Fig. 7.12 the structure
of such a model is depicted. In the example, the most frequently used type of HMM
for audio processing is depicted—the so-called left-right model. In this model type,
the state number cannot decrease over time. In the 'linear' model, no state can be
skipped. Other topologies allow for a state skip, such as the Bakis model in which
one state may be skipped. If any state can be reached from any other state with a
probability above zero, the topology is referred to as 'ergodic'.
One speaks of a 'hidden' Markov model, as the sequence of states remains
unknown—only the observations sequence is known [ 32 ]. Note the 'Markov prop-
erty' that the conditional probability distribution of the hidden variable s
x
)
at time
step t , given the values of the hidden variable s at all times, depends only on the
hidden variable s
(
t
)
(
t
1
)
, i.e., values at earlier steps in time have no influence [ 34 ].
 
 
Search WWH ::




Custom Search