Digital Signal Processing Reference
In-Depth Information
If we further assume that the acoustic realisations of the audio events are inde-
pendent of each other, the audio events can be modelled individually:
p
(
X
| S j ) =
p
(
x 1 ,...,
x i )
p
(
x i + 1 ,...,
x j )...
p
(
x k + 1 ,...,
x A )
(7.72)
It is assumed that these audio events occur without pauses and pauses are treated
as audio events. Note that the audio event boundaries i
k and audio event
number A are unknown and need to be determined by the classifier.
In the same way each audio event can be constructed by a sequence of audio
sub-events (ASE) one level lower in hierarchy again assuming independence. In the
case of speech these could be phonemes, triphones, or syllables, etc. In the case of
chord arpeggios, these could be note events. If the ASE are modelled by HMM, the
Viterbi algorithm can be applied on all three layers [ 36 ]: for the search of the state
sequence within the HMMs, for the sequence of the individual ASE HMMs in each
audio event, and for the sequence of the audio events, i.e.,
,
j
,...,
S
.
At the audio event transitions the LM can be applied to model higher level infor-
mation by transition probabilities [ 38 ]. These can for example be N-grams that model
the conditional probability of a sequence of consecutive audio events, e.g., two or
three. The Viterbi path determines the optimal path through all layers and by that
the optimal sequence recognition with the optimal sequence of audio events—for an
illustrative example see 7.13 where an according 'Trellis' is shown [ 32 ].
If the number of audio events—the 'vocabulary' size—is very large, the Viterbi
search can become very computationally demanding and thus slow. Though at time
step t only a single column needs to be analysed in the Trellis diagram (cf. Fig. 7.13 ),
all emission probabilities in all states for all ASE in all audio events need to be com-
puted. In the case of large vocabulary continuous speech recognition (LVCSR) this
may easily require computation of more than 100 000 normal distributions in 10 ms
[ 36 ]. One can thus make use of the fact that usually many paths in the Trellis are not
promising in the sense that they lead to the overall best path, which is searched for.
The 'beam search' thus prunes these candidates accepting a sub-optimal solution
(usually less then one percentage point increase in error probability) at consider-
able speed-up and reduced memory consumption. This is reached by a smart list
management in five consecutive steps [ 39 ]:
First, at time step t a list of all active states is set-up. This contains all the points
in the Trellis diagram whose validation exceeds a given threshold. Each element in
the list is stored by the audio event number, the ASE number, the state number, and
its validation.
Then, from this list all possible subsequent states are computed that can be reached
by the Viterbi path-diagram. To this end, the path diagram is applied in forward
direction by overwriting the place-holders in the transition from
each
according to higher validation. The algorithm works in a recursive manner as usual
and the effect is the same as when applying the path diagrams as in the usual case in
backward direction.
Next, the list of subsequent states is reduced by deleting those states below the
threshold—this is the actual pruning. This threshold is best constantly adapted to the
(
t
1
)
to
(
t
)
 
 
Search WWH ::




Custom Search