Audio Recognition - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

b t =

arg max j (

o t , 1 ,...,

o t , j ,...,

o t , P )

(7.83)

In every time step the BLSTM generates a class prediction according to Eq. ( 7.83 )

and the HMM models x 1 : T

and b 1 : T as two independent data streams. With y t

[

being the joint feature vector consisting of continuous audio features and

discrete BLSTM observations and the variable a denoting the stream weight of

the first stream (i.e., the audio feature stream), the multi-stream HMM emission

probability while being in a certain state s t can be written as

x t ;

b t ]

−

(

y t |

s t ) =

c s t m N(

x t ; μ s t m , s t m )

(

b t |

s t )

(7.84)

m =

Thus, the continuous audio feature observations are modelled via a mixture of

M Gaussians per state while the BLSTM prediction is modelled using a discrete

probability distribution p

. The index m denotes the mixture component, c s t m

is the weight of the m 'th Gaussian associated with state s t , and

(

b t |

s t )

N( ·; μ,)

represents

a multivariate Gaussian distribution with mean vector

and covariance matrix

The distribution p

(

b t |

s t )

is trained to model typical class confusions that occur in

the BLSTM network.

7.5 Evaluation

7.5.1 Partitioning and Balancing

We now deal with typical ways of evaluating audio recognition systems' performance.

We thereby focus on measurements that judge the reliability of the recognition result

as these are of major interest in the extensive body of literature on intelligent speech,

music, and sound analysis. However, as shown in the requirements section, a number

of further aspects could be considered, such as real-time ability.

Evaluation should ideally be based on test partition(s) of suited audio databases

that have not been 'seen' during system optimisation. Such optimisation includes

data-based tuning of any steps in the chain of audio analysis including enhancement,

feature extraction and normalisation, feature selection, parameter selection for the

learning algorithm, etc. Thus, besides a training partition, a 'development' partition

is needed for the above named optimisation steps. During the final system training,

however, training and development partitions may be united in order to provide

more learning material to the system. In general, one wishes all partitions to be

somewhat large. For test, this is needed in order to provide significant results. Popular

'percentage splits' are thus 40 %:30 %:30 % for training, development, and test. In

case of very large databases, as often given in ASR, the test partition is often chosen

smaller, as around 10 %.

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home