Digital Signal Processing Reference
In-Depth Information
10.2 Linguistics: Spontaneous Speech
Let us now consider spontaneous speech recognition in a real-life setting, as in [ 28 ]
basing on the related studies published in [ 11 , 28 - 42 ]. By that, we will move from
isolated to continuous speech recognition independent of the speaker. At the same
time, we will consider a setting with real-life noise conditions rather than additive
noises. The motivation for this shift is among other reasons given by the fact that
ASR is increasingly applied in highly naturalistic human-machine interaction such
as with conversational agents [ 36 , 43 ] or robots. This requires robustly recognising
spontaneous, conversational, and by that also disfluent speech. Several strategies to
cope with these challenges have been proposed [ 10 , 19 ]. Of these, most concentrate
on improving the signal-processing front-end or computational intelligence back-
end side of ASR systems based on HMM. There are, however, also strategies that
combine the HMM principle with MLP or RNN [ 33 , 44 , 45 ]. Roughly, one can
categorise these into hybrid approaches that apply neural networks to generate state
posteriors for HMMs, and Tandem approaches that use a neural network's output as
features to be input into the HMM.
Given co-articulation effects in human speech, modelling of temporal context is
essential. For this reason the introduced LSTM networks seem a promising alterna-
tive to standard feed-forward networks or RNNs. While temporal context is usually
modelled on a higher level by context dependent acoustic models, such as triphone
models, and language models, on the feature level only a very limited and inflexible
amount of context is modelled. e.g., first and second order regression coefficients of
LLDs are added to the feature vector or a fixed number of successive feature frames
are 'stacked'. Only few exceptions aimed at modelling of more context, e.g., [ 46 ].
Recently, BLSTM networks were shown to be superior to the triphone principle [ 47 ],
and application of BLSTM phoneme prediction has led to significant performance
gains for phoneme classification and keyword spotting [ 30 , 36 , 48 ]. Building on the
Tandem technique as was proposed in [ 35 ], which uses BLSTM phoneme predic-
tions as additional feature vector components, this section introduces a multi-stream
BLSTM-HMM architecture. This architecture models the BLSTM phoneme estimate
as a second independent HMM stream of observations to allow for more accurate
modelling of observed phoneme predictions. Experiments to show this effect are
based on the COSINE corpus [ 49 ] which contains noisy conversational speech. With
the open-source speech processing toolkit openSMILE [ 50 ] this multi-stream tech-
nique is implemented in an on-line version in the final SEMAINE 1
system [ 43 ]—a
multimodal conversational agent.
10.2.1 The COSINE Corpus
All experiments presented in Sect. 10.2.2 are speaker-independent. They were car-
ried out using the 'COnversational Speech In Noisy Environments' (COSINE)
1
http://semaine-project.eu/
 
Search WWH ::




Custom Search