Hierarchical Approach and Downsampling Schemes - Hierarchical Neural Network Structures for Phoneme Recognition

Digital Signal Processing Reference

In-Depth Information

duration. In addition, no phoneme insertion penalty is used in the inter-

mediate Viterbi decoder. Moreover, as it is performed for the uniform down-

sampling scheme, the training set of intermediate posteriors and the true

labels are also downsampled based on the non-uniform sampling points. For

training MLP 2, a window of 2 d 2 + 1 consecutive posterior vectors is used.

4.3 Window Downsampling

In the previous section we have observed how irrelevant information can be

removed at the input of MLP 2. As it was mentioned before, the input of

MLP 2 is a window of 2 d 2 + 1 consecutive posteriors vectors. Following the

temporal-downsampling scheme, it may be worth also removing irrelevant in-

formation contained in the window of intermediate posteriors [Vasquez 09a].

To support this idea, the autocorrelation of the intermediate posteriors is

estimated as a measure of the redundant information involved in the window:

T−t i

1

R k,t i =

[ x k,t −

μ k ][ x k,t + t i −

μ k ]

(4.3)

t i ) σ k

( T

−

t =1

where k indicates the k th dimension of the posterior vector x t . R k,t i , μ k

and σ k are the autocorrelation, mean and variance of the k th dimension

respectively. The time units are given by frames, where T represents the

total number of frames and t i is the frame shift.

Fig. 4.5 shows the average of the autocorrelation over all dimensions (ex-

cluding silence). As expected, there is a high correlation among neighboring

frames since a phoneme stretches over a large temporal context. We can re-

move all this repeated information by performing a window-downsampling

at the input of MLP 2.

The total number of consecutive frames constituting the window is 2 d 2 +1.

M frames uniformly separated out of 2 d 2 +1 frames can be selected as a new

window at the input of the MLP 2. The relation between M and d 2 is given

by:

M = 2 d 2

T w

+ 1

(4.4)

where T w is the window sampling period. As an example, in most of the

experiments d 2 = 10, covering the levels of correlation in the interval t i =

[ − 10 , 10] given in Fig. 4.5. In addition, the levels of correlation for a window

with T w = 5 are also shown (indicated by bars). It can be observed that

high-correlated information can be ignored when T w increases. In addition,

the number of posterior vectors at the input of MLP 2 is highly reduced,

decreasing number of parameters. Moreover, Section 4.4.3 shows that this

approach does not affect system accuracy.

The current and previous sections described two different schemes: tempo-

ral and window-downsampling. These schemes allow to speed the phoneme

Hierarchical Neural Network Structures for Phoneme Recognition

Search WWH ::

Custom Search

Home