Digital Signal Processing Reference
In-Depth Information
Table 10.6 Frame-wise phoneme accuracy for BLSTM, LSTM, BRNN, and RNN predictors, and
triphone HMMs, and word accuracies obtained for a baseline single-stream HMM, a Tandem system
[ 35 ], and the proposed multi-stream recogniser ( a
=
1
.
1) using different network architectures
Accuracy
Phoneme
Word
[
%
]
(framewise)
Tandem
Multi-stream
triphone HMMs
56.91
43.36
RNN
48.91
43.79
46.25
BRNN
50.51
42.59
46.27
LSTM
58.91
44.46
46.45
BLSTM
66.41
45.04
46.50
validation set could be observed for at least 50 epochs, training was stopped. Then,
the network was chosen that achieved the best frame-wise phoneme error rate on the
validation set.
The second column of Table 10.6 shows the frame-level phoneme accuracies for
COSINE's test set obtained in this way. Generally, bidirectional context modelling
prevails over unidirectional context modelling and LSTM context modelling outper-
forms conventional RNNs. The best rate can be achieved with a BLSTM network
at 66.41 % WA. The use of bidirectional context in low-latency, responsive, on-line
applications is, however, limited or close to impossible. For off-line transcription
tasks, and on-line tasks which allow a higher latency, BLSTM networks are per-
fectly applicable.
When using a triphone HMM system as described below for frame-wise phoneme
transcription, the rate is significantly lower at 56.91 % WA—this is in line with [ 51 ].
However, triphone HMMs were able to outperform a conventional RNN phoneme
predictor (50.51 % and 48.91 % WA for bi- and unidirectional RNNs, respectively).
As explained, BLSTM phoneme estimates are now incorporated as an additional
feature stream into a multi-stream HMM framework for the recognition of con-
tinuous speech. To this end, each phoneme of the underlying left-to-right HMM
system is represented by three emitting states. The initial monophone models con-
sist of one Gaussian mixture for probability density function modelling per state.
They were trained using four iterations of embedded Baum-Welch re-estimation.
Then, the monophones were mapped to tied-state cross-word triphone models with
shared state transition probabilities. This sharing helps to reduce the number of
parameters that need to be estimated. Given COSINE's comparably limited size,
this is a reasonable standard measure. Two Baum-Welch iterations were executed
for re-estimation of the triphone models. Finally, the number of Gaussian mixture
components of the triphone models was increased to 16 in four successive rounds
of mixture doubling and re-estimation—resembling four iterations in every round.
AMs and a back-off bi-gram language model were trained on COSINE's training set.
The conditional probability table for the second feature stream was restricted to the
15 most likely phoneme confusions per state. Further, a floor value of 0.01 was used
for the remaining confusion likelihoods. As shown in Table 10.6 , the word accuracy
of the single-stream HMM is 43.36 % WA.
 
Search WWH ::




Custom Search