Digital Signal Processing Reference
In-Depth Information
The use of PLP features as speech representation leads to a relative error reduction
of 18.6 % (averaged over all evaluated noise conditions) when compared to 'conven-
tional' MFCC. Furthermore, feature enhancement methods based on spectral sub-
traction and normalisation like CMS, MVN, USS or HEQ were able to partly remove
effects of stationary coloured noises.
As a further approach to enhance speech features, a global SLDM was used. This
aims to capture the dynamics of speech enabling a model based speech enhancement
through joint speech and noise modelling and prevailed for all car noise types. In fact,
the method reached the best WA for the noisy isolated digit recognition task. The
usage of HCRF as an alternative model architecture did not outperform HMM. How-
ever, embedding a SLDS into a SAR-HMM—modelling the raw signal in the time
domain—lead to the best WA in case of speech corrupted with additive WGN. Using
noisy training data to build AMs could also improve noise robustness. Mismatched
conditions training, which uses training sequences disturbed by a noise type different
from that in the test phase, outperformed training on clean data with a relative error
reduction of 54.5 %. This shows that multi-condition training is a promising direc-
tion. Further, computational complexity and possible fields of application have to be
considered when designing a robust speech recogniser. In this respect, AFE and USS
are more complex than feature normalisation techniques such as CMS or MVN. But,
they are still suited for real-time applications. HEQ and SLDM feature enhancement
achieved better recognition rates, however, at the cost of increased computational
complexity. Speech-modelling in the time domain as by AR-SLDS requires most
computational resources and is therefore at the time not suited for most real-life
applications. For stationary noises, the SLDM seems the most promising technique.
Yet, it relies on accurate voice activity detection.
Future research effort could be spent on increasing the suitability of promising
concepts like SLDM feature enhancement by including discrete state transition prob-
abilities. Another alternative would be finding the optimum compromise between an
increment of the history parameter and the computational complexity. Furthermore,
the AR-SLDS concept could be optimised for coloured noise when applying AR
speech modelling in this context. Further improvements might be achieved by com-
bining the different denoising concepts which were applied in this section.
In a continuous ASR task—as will be considered next—the parameters of a global
SLDM as well as the cumulative histogram for the HEQ method could be estimated
more precisely due to longer observation sequences than in the so far considered iso-
lated digit or spelling recognition experiments. ASR in noisy environments remains
challenging, however, as shown in this section, spending effort on finding accu-
rate techniques for auditory modelling, feature enhancement, speech modelling, and
model adaptation can remarkably reduce the performance gap between automatic
speech recognition and human perception.
Search WWH ::




Custom Search