Digital Signal Processing Reference
In-Depth Information
a hierarchical structure where the output of a first level classifier based on
neural networks are postprocessing and then sent to a second level classi-
fier based on HMM/GMM. The classifier situated at the first level consists
of a MLP which takes as input a window of consecutive cepstral features
for estimating phoneme posterior probabilities. The MLP outputs are then
postprocessed by a logarithm function or simply by removing the output
softmax nonlinearity function. Then, the features are “gaussianized” by a
Karhunen-Loeve (KL) liner transformation, or more complex transformations
such as HLDA [Zhu 04], given the skewed distribution at the output of the
postprocessing. This transformation together with a possible dimensionality
reduction makes the features more suitable to be modeled by Gaussian mix-
tures. Finally, the output of the linear transformation are used as observation
features in a HMM/GMM classifier.
This method has shown promising results in the task of word recogni-
tion [Hermansky 00] and phoneme recognition [Fosler-Lussier 08] in com-
parison to standard HMM/GMM based on cepstral features. Additionally,
evaluations of the Tandem technique have been carried out in [Sivadas 02]
where the MLP has been replaced by a hierarchical MLP structure. However,
some drawbacks of the tandem technique are mentioned in [Aradilla 08] where
the potential of the discriminative features at the output of the MLP can be
lost during the postprocessing step.
Some other attempts to combine discriminative and generative models are
explored in [Pinto 08a]. Opposite to tandem technique, likelihoods estimated
by a HMM/GMM based on cepstral features are input to a MLP that esti-
mates phoneme posterior probabilities. This system is evaluated in a phoneme
recognition task based on the hybrid HMM/ANN approach. Additionally, in
the same work, more complex hierarchical structures are evaluated, aiming to
combine likelihoods estimated by GMM and posterior probabilities estimated
by MLPs.
3.5.2 Conditional Random Fields
Recently, Conditional Random Fields (CRFs) have gained interest in the field
of ASR given its successful adaptation as statistical models in the speech
recognition process. In fact, it is shown in [Gunawardana 05] that CRFs can
be seen as a generalization of HMMs, where the transition probabilities are
estimated by feature functions depending on the entire observation sequence
and the transition time instance. In addition, further theoretical advantages
of CRFs over HMMs can be referred to a discriminative training criterion, no
assumptions about the interdependence of consecutive observed features and
the capacity of estimating negative evidences [Gunawardana 05, Lafferty 01,
Fosler-Lussier 08].
A Conditional Random Field estimates the posterior probability of a la-
bel sequence q 1: T given the entire observation sequence o 1: T i.e., p ( q 1: T |
o 1: T )
 
Search WWH ::




Custom Search