Digital Signal Processing Reference
In-Depth Information
b t =
arg max j (
o t , 1 ,...,
o t , j ,...,
o t , P )
(7.83)
In every time step the BLSTM generates a class prediction according to Eq. ( 7.83 )
and the HMM models x 1 : T
and b 1 : T as two independent data streams. With y t
=
[
being the joint feature vector consisting of continuous audio features and
discrete BLSTM observations and the variable a denoting the stream weight of
the first stream (i.e., the audio feature stream), the multi-stream HMM emission
probability while being in a certain state s t can be written as
x t ;
b t ]
a
M
2
a
p
(
y t |
s t ) =
c s t m N(
x t ; μ s t m , s t m )
×
p
(
b t |
s t )
.
(7.84)
m =
1
Thus, the continuous audio feature observations are modelled via a mixture of
M Gaussians per state while the BLSTM prediction is modelled using a discrete
probability distribution p
. The index m denotes the mixture component, c s t m
is the weight of the m 'th Gaussian associated with state s t , and
(
b t |
s t )
N( ·; μ,)
represents
a multivariate Gaussian distribution with mean vector
μ
and covariance matrix
.
The distribution p
(
b t |
s t )
is trained to model typical class confusions that occur in
the BLSTM network.
7.5 Evaluation
7.5.1 Partitioning and Balancing
We now deal with typical ways of evaluating audio recognition systems' performance.
We thereby focus on measurements that judge the reliability of the recognition result
as these are of major interest in the extensive body of literature on intelligent speech,
music, and sound analysis. However, as shown in the requirements section, a number
of further aspects could be considered, such as real-time ability.
Evaluation should ideally be based on test partition(s) of suited audio databases
that have not been 'seen' during system optimisation. Such optimisation includes
data-based tuning of any steps in the chain of audio analysis including enhancement,
feature extraction and normalisation, feature selection, parameter selection for the
learning algorithm, etc. Thus, besides a training partition, a 'development' partition
is needed for the above named optimisation steps. During the final system training,
however, training and development partitions may be united in order to provide
more learning material to the system. In general, one wishes all partitions to be
somewhat large. For test, this is needed in order to provide significant results. Popular
'percentage splits' are thus 40 %:30 %:30 % for training, development, and test. In
case of very large databases, as often given in ASR, the test partition is often chosen
smaller, as around 10 %.
 
 
Search WWH ::




Custom Search