Digital Signal Processing Reference
In-Depth Information
duration. In addition, no phoneme insertion penalty is used in the inter-
mediate Viterbi decoder. Moreover, as it is performed for the uniform down-
sampling scheme, the training set of intermediate posteriors and the true
labels are also downsampled based on the non-uniform sampling points. For
training MLP 2, a window of 2 d 2 + 1 consecutive posterior vectors is used.
4.3 Window Downsampling
In the previous section we have observed how irrelevant information can be
removed at the input of MLP 2. As it was mentioned before, the input of
MLP 2 is a window of 2 d 2 + 1 consecutive posteriors vectors. Following the
temporal-downsampling scheme, it may be worth also removing irrelevant in-
formation contained in the window of intermediate posteriors [Vasquez 09a].
To support this idea, the autocorrelation of the intermediate posteriors is
estimated as a measure of the redundant information involved in the window:
T−t i
1
R k,t i =
[ x k,t
μ k ][ x k,t + t i
μ k ]
(4.3)
t i ) σ k
( T
t =1
where k indicates the k th dimension of the posterior vector x t . R k,t i , μ k
and σ k are the autocorrelation, mean and variance of the k th dimension
respectively. The time units are given by frames, where T represents the
total number of frames and t i is the frame shift.
Fig. 4.5 shows the average of the autocorrelation over all dimensions (ex-
cluding silence). As expected, there is a high correlation among neighboring
frames since a phoneme stretches over a large temporal context. We can re-
move all this repeated information by performing a window-downsampling
at the input of MLP 2.
The total number of consecutive frames constituting the window is 2 d 2 +1.
M frames uniformly separated out of 2 d 2 +1 frames can be selected as a new
window at the input of the MLP 2. The relation between M and d 2 is given
by:
M = 2 d 2
T w
+ 1
(4.4)
where T w is the window sampling period. As an example, in most of the
experiments d 2 = 10, covering the levels of correlation in the interval t i =
[ 10 , 10] given in Fig. 4.5. In addition, the levels of correlation for a window
with T w = 5 are also shown (indicated by bars). It can be observed that
high-correlated information can be ignored when T w increases. In addition,
the number of posterior vectors at the input of MLP 2 is highly reduced,
decreasing number of parameters. Moreover, Section 4.4.3 shows that this
approach does not affect system accuracy.
The current and previous sections described two different schemes: tempo-
ral and window-downsampling. These schemes allow to speed the phoneme
 
Search WWH ::




Custom Search