Applications in Intelligent Speech Analysis - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

10.2 Linguistics: Spontaneous Speech

Let us now consider spontaneous speech recognition in a real-life setting, as in [ 28 ]

basing on the related studies published in [ 11 , 28 - 42 ]. By that, we will move from

isolated to continuous speech recognition independent of the speaker. At the same

time, we will consider a setting with real-life noise conditions rather than additive

noises. The motivation for this shift is among other reasons given by the fact that

ASR is increasingly applied in highly naturalistic human-machine interaction such

as with conversational agents [ 36 , 43 ] or robots. This requires robustly recognising

spontaneous, conversational, and by that also disfluent speech. Several strategies to

cope with these challenges have been proposed [ 10 , 19 ]. Of these, most concentrate

on improving the signal-processing front-end or computational intelligence back-

end side of ASR systems based on HMM. There are, however, also strategies that

combine the HMM principle with MLP or RNN [ 33 , 44 , 45 ]. Roughly, one can

categorise these into hybrid approaches that apply neural networks to generate state

posteriors for HMMs, and Tandem approaches that use a neural network's output as

features to be input into the HMM.

Given co-articulation effects in human speech, modelling of temporal context is

essential. For this reason the introduced LSTM networks seem a promising alterna-

tive to standard feed-forward networks or RNNs. While temporal context is usually

modelled on a higher level by context dependent acoustic models, such as triphone

models, and language models, on the feature level only a very limited and inflexible

amount of context is modelled. e.g., first and second order regression coefficients of

LLDs are added to the feature vector or a fixed number of successive feature frames

are 'stacked'. Only few exceptions aimed at modelling of more context, e.g., [ 46 ].

Recently, BLSTM networks were shown to be superior to the triphone principle [ 47 ],

and application of BLSTM phoneme prediction has led to significant performance

gains for phoneme classification and keyword spotting [ 30 , 36 , 48 ]. Building on the

Tandem technique as was proposed in [ 35 ], which uses BLSTM phoneme predic-

tions as additional feature vector components, this section introduces a multi-stream

BLSTM-HMM architecture. This architecture models the BLSTM phoneme estimate

as a second independent HMM stream of observations to allow for more accurate

modelling of observed phoneme predictions. Experiments to show this effect are

based on the COSINE corpus [ 49 ] which contains noisy conversational speech. With

the open-source speech processing toolkit openSMILE [ 50 ] this multi-stream tech-

nique is implemented in an on-line version in the final SEMAINE 1

system [ 43 ]—a

multimodal conversational agent.

10.2.1 The COSINE Corpus

All experiments presented in Sect. 10.2.2 are speaker-independent. They were car-

ried out using the 'COnversational Speech In Noisy Environments' (COSINE)

1

Search WWH ::

Custom Search

Home