Theoretical Framework for Phoneme Recognition Analysis - Hierarchical Neural Network Structures for Phoneme Recognition

Digital Signal Processing Reference

In-Depth Information

c i 1 = βp ( ph i |

ph i )

(6.3)

c i 2 = α (1

c i 1 )

−

(6.4)

c i 2 =(1

c i 1 )

−

α )(1

−

(6.5)

c ij =0for j ≥ 4

(6.6)

By using C , each phoneme can be only confused with the two most likely

phonemes. Additionally, the factor β allow to modify the confusion level

among phonemes. For the case that βp ( ph i |

1then c i 1 =1.

For setting T conf , two modalities are investigated. In the first option, T conf

is set to a constant value. In the second option, T conf (notated as T conf ) varies

depending on the phoneme emitted. If q t = ph i , then:

ph i )

≥

l i

T conf =

(6.7)

where l i is the average phoneme length in frames of ph i and λ is a constant

value. By setting T conf depending on l i , we avoid that long periods affect

short phonemes, which would result in high phoneme deletions.

6.1.4 Windowing

A windowing module is introduced for smoothing posterior probabilities dur-

ing phoneme transitions. In this topic, the impulse response of this module

is:

w ( t )=

w ( k ) δ ( t

−

k )

(6.8)

k = − 1

where w − 1 = w 1 =0 . 15 and w 0 =0 . 7. w ( t ) has been determined heuristically.

The output is obtained by convolution:

x i ( t )= w ( t )

∗

x i ( t )

(6.9)

6.1.5 MIMO Channel

A Multiple Input Multiple Output (MIMO) channel [Linder 05] is used, which

is a linear time invariant system representing a simplified model of the in-

teraction among phonemes. The impulse response of the channel is given

by:

⎡

⎣

⎤

⎦

h 11 ( t ) h 12 ( t ) ... h 1 n ( t )

h 21 ( t ) h 22 ( t ) ... h 2 n ( t )

... ... ... ...

h n 1 ( t ) h n 2 ( t ) ... h nn ( t )

H ( t )=

(6.10)

and the received signal is obtained by convolution:

Hierarchical Neural Network Structures for Phoneme Recognition

Search WWH ::

Custom Search

Home