Digital Signal Processing Reference
In-Depth Information
6.1.3 Phoneme Confusion
The phoneme confusion module (CONF) is based on the confusion matrix
( C ) at the output of the first hierarchical level ( MLP 1). This square matrix
of order n ,where n is the total number of phonemes, is estimated at the
frame level. C gives the probability that at any time instance t , a phoneme
emitted ph i is confused by a phoneme ph j ( p ( ph j |ph i )), where i and j indicate
the row and column of the matrix, respectively. As an example, c db =0 . 2in
Fig. 6.2.
This module works as follows: First, each row of C is organized in a de-
scending order generating the confusion matrix C as shown in Fig. 6.2.
Therefore, while C usually has its highest values in the main diagonal, C
has the highest values in the first column. In addition, in contrast to C ,the
phonemes indicated by the columns are not common for all rows in C .
b
d
p
0 . 90 b
0 . 20 b
0 . 60 b
0 . 10 d
0 . 70 d
0 . 00 d
0 . 00 p
0 . 10 p
0 . 40 p
b
d
p
0 . 90 b
0 . 70 d
0 . 60 b
0 . 10 d
0 . 20 b
0 . 40 p
0 . 00 p
0 . 10 p
0 . 00 d
b
d
p
0 . 90 b
0 . 70 d
0 . 40 p
0 . 07 d
0 . 21 b
0 . 03 p
0 . 09 p
0 . 18 d
0 . 42 b
C
C
C
Fig. 6.2. Example of the different confusion matrix representations. For C , β =1
and α =0 . 7.
Later on, a random number rnd t in the interval [0 , 1] is generated each
confusion period T conf . The random number is retained along the entire pe-
riod T conf .Theway T conf is set will be explained below. From now we can
assume that at time instance t , the corresponding phoneme and random num-
ber are q t = ph i and rnd t , respectively. Then, the row indicated by ph i in C
is scanned from the first to the last column, looking for an interval containing
rnd t . The upper boundary of the interval at each column is determined by
the sum of probabilities from the first to the current column. If the interval is
found, the phoneme corresponding to the column is emitted. This algorithm
is performed each frame. However, by setting T conf > 1 frame, we assure that
phonemes last several frames.
As an example, let us assume that ph i =/d/and rnd t =0 . 95. Based on
C in Fig. 6.2 the output phoneme of this module would be ph i =/p/or
equivalently, the posterior vector of A x corresponding to ph i =/p/.
In this topic, a variation of C notated as C is also investigated as shown
in Fig. 6.2. As in C , the rows are organized in descending order. However,
for C the first column is forced to keep the values of the main diagonal of
C . In addition the values of the remaining columns depend on the values of
the first column as follows:
 
Search WWH ::




Custom Search