Applications in Intelligent Speech Analysis - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

Table 10.6 Frame-wise phoneme accuracy for BLSTM, LSTM, BRNN, and RNN predictors, and

triphone HMMs, and word accuracies obtained for a baseline single-stream HMM, a Tandem system

[ 35 ], and the proposed multi-stream recogniser ( a

=

1

.

1) using different network architectures

Accuracy

Phoneme

Word

[

%

]

(framewise)

Tandem

Multi-stream

triphone HMMs

56.91

43.36

RNN

48.91

43.79

46.25

BRNN

50.51

42.59

46.27

LSTM

58.91

44.46

46.45

BLSTM

66.41

45.04

46.50

validation set could be observed for at least 50 epochs, training was stopped. Then,

the network was chosen that achieved the best frame-wise phoneme error rate on the

validation set.

The second column of Table 10.6 shows the frame-level phoneme accuracies for

COSINE's test set obtained in this way. Generally, bidirectional context modelling

prevails over unidirectional context modelling and LSTM context modelling outper-

forms conventional RNNs. The best rate can be achieved with a BLSTM network

at 66.41 % WA. The use of bidirectional context in low-latency, responsive, on-line

applications is, however, limited or close to impossible. For off-line transcription

tasks, and on-line tasks which allow a higher latency, BLSTM networks are per-

fectly applicable.

When using a triphone HMM system as described below for frame-wise phoneme

transcription, the rate is significantly lower at 56.91 % WA—this is in line with [ 51 ].

However, triphone HMMs were able to outperform a conventional RNN phoneme

predictor (50.51 % and 48.91 % WA for bi- and unidirectional RNNs, respectively).

As explained, BLSTM phoneme estimates are now incorporated as an additional

feature stream into a multi-stream HMM framework for the recognition of con-

tinuous speech. To this end, each phoneme of the underlying left-to-right HMM

system is represented by three emitting states. The initial monophone models con-

sist of one Gaussian mixture for probability density function modelling per state.

They were trained using four iterations of embedded Baum-Welch re-estimation.

Then, the monophones were mapped to tied-state cross-word triphone models with

shared state transition probabilities. This sharing helps to reduce the number of

parameters that need to be estimated. Given COSINE's comparably limited size,

this is a reasonable standard measure. Two Baum-Welch iterations were executed

for re-estimation of the triphone models. Finally, the number of Gaussian mixture

components of the triphone models was increased to 16 in four successive rounds

of mixture doubling and re-estimation—resembling four iterations in every round.

AMs and a back-off bi-gram language model were trained on COSINE's training set.

The conditional probability table for the second feature stream was restricted to the

15 most likely phoneme confusions per state. Further, a floor value of 0.01 was used

for the remaining confusion likelihoods. As shown in Table 10.6 , the word accuracy

of the single-stream HMM is 43.36 % WA.

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home