Audio Recognition - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

If we further assume that the acoustic realisations of the audio events are inde-

pendent of each other, the audio events can be modelled individually:

(

| S j ) =

(

x 1 ,...,

x i )

(

x i + 1 ,...,

x j )...

(

x k + 1 ,...,

x A )

(7.72)

It is assumed that these audio events occur without pauses and pauses are treated

as audio events. Note that the audio event boundaries i

k and audio event

number A are unknown and need to be determined by the classifier.

In the same way each audio event can be constructed by a sequence of audio

sub-events (ASE) one level lower in hierarchy again assuming independence. In the

case of speech these could be phonemes, triphones, or syllables, etc. In the case of

chord arpeggios, these could be note events. If the ASE are modelled by HMM, the

Viterbi algorithm can be applied on all three layers [ 36 ]: for the search of the state

sequence within the HMMs, for the sequence of the individual ASE HMMs in each

audio event, and for the sequence of the audio events, i.e.,

,...,

At the audio event transitions the LM can be applied to model higher level infor-

mation by transition probabilities [ 38 ]. These can for example be N-grams that model

the conditional probability of a sequence of consecutive audio events, e.g., two or

three. The Viterbi path determines the optimal path through all layers and by that

the optimal sequence recognition with the optimal sequence of audio events—for an

illustrative example see 7.13 where an according 'Trellis' is shown [ 32 ].

If the number of audio events—the 'vocabulary' size—is very large, the Viterbi

search can become very computationally demanding and thus slow. Though at time

step t only a single column needs to be analysed in the Trellis diagram (cf. Fig. 7.13 ),

all emission probabilities in all states for all ASE in all audio events need to be com-

puted. In the case of large vocabulary continuous speech recognition (LVCSR) this

may easily require computation of more than 100 000 normal distributions in 10 ms

[ 36 ]. One can thus make use of the fact that usually many paths in the Trellis are not

promising in the sense that they lead to the overall best path, which is searched for.

The 'beam search' thus prunes these candidates accepting a sub-optimal solution

(usually less then one percentage point increase in error probability) at consider-

able speed-up and reduced memory consumption. This is reached by a smart list

management in five consecutive steps [ 39 ]:

First, at time step t a list of all active states is set-up. This contains all the points

in the Trellis diagram whose validation exceeds a given threshold. Each element in

the list is stored by the audio event number, the ASE number, the state number, and

its validation.

Then, from this list all possible subsequent states are computed that can be reached

by the Viterbi path-diagram. To this end, the path diagram is applied in forward

direction by overwriting the place-holders in the transition from

each

according to higher validation. The algorithm works in a recursive manner as usual

and the effect is the same as when applying the path diagrams as in the usual case in

backward direction.

Next, the list of subsequent states is reduced by deleting those states below the

threshold—this is the actual pruning. This threshold is best constantly adapted to the

(

−

)

(

)

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home