Digital Signal Processing Reference
In-Depth Information
Future experiments should include the design of bottle-neck [ 40 , 45 ] BLSTM
networks. Further, the principle of Connectionist Temporal Classification (CTC) [ 33 ]
could be exploited to let the networks do the phoneme alignments by themselves and
thus improve the accuracy of phoneme targets (which in the presented system were
obtained by an HMM-based recogniser).
10.3 Non-Linguistics: Vocalisations
Apart from speech—i.e., linguistic entities—, non-linguistic vocalisations are present
in spoken language—their computational assessment will now be shown.
Discrimination of speech and non-linguistic vocalisations such as laughter or
sighs plays an important role in speech recognition systems dealing with sponta-
neous speech, such as dialogue systems, call centre loops or automatic transcription
of meetings. In contrast to read speech, which conveys only the information con-
tained in the spoken words and sentences, spontaneous speech contains considerably
more of such extra-linguistic information—e.g., in the COSINE corpus which was
introduced in the previous section. To avoid confusion of non-linguistic information
with linguistic information and for higher level natural language understanding, it is
vital for an ASR engine to spot the non-linguistic vocalisations and determine their
type [ 52 - 59 ].
Several approaches have been proposed in particular for the detection of filled
pauses [ 60 ] and laughter [ 61 - 63 ]. In this section, we extend this to four different
types of non-linguistic vocalisations—laughter, breathing, hesitation (e.g., “uhm” )
or non-verbal consent (e.g., “aha” )—and discriminate them from speech.
Furthermore, it will be shown that features generated by NMF can increase
classification accuracy for this task when compared to traditional acoustic feature
information—for example MFCCs. To this end, a supervised NMF variant is sug-
gested with pre-computed component spectra from instances of speech and non-
linguistic vocalisations. This allows to measure which spectra contribute the most
to the signal based on the activations of these components. Previous work in NMF-
based ASR uses NMF for speech enhancement applied during pre-processing. In
contrast, it is now proposed to use the NMF as data-based feature extractor as was
introduced in Sect. 8.3 . For sound classification such an approach has been described
in [ 64 ]. However, for non-linguistic vocalisation classification, this technique was
first proposed in [ 12 ]. Experimental results are based on the TUM AVIC database
[ 65 ] (cf. Sect. 5.3.1 ) .
10.3.1 Methodology
The input signal is transformed to the frequency domain. A STFT is applied with
a Hamming window, 25 ms window size, and 10 ms frame rate. From the resulting
 
Search WWH ::




Custom Search