Information Technology Reference
In-Depth Information
Table 18.3 Official feature
sets of Interspeech
Challenges
Feature set
IS-2010
IS-2011
IS-2012
IS-2013
# of LLDs
38
59
64
59
# of functional
21
39
40
48
# of features
1,582
4,368
6,124
6,373
measure, RASTA-style filtered auditory spectra, and statistical spectral descriptors
(such as flux, entropy, variance) have been introduced. Additional functionals, such
as quadratic regression and linear predictive coefficients and peak distances, allowed
a better estimation of the temporal variability. The IS-2012 feature set consists of
6,124 audio features, which were computed from 64 LLDs and 40 functionals. Few
LLDs have been added, including the logarithmic harmonics-to-noise ratio, spectral
harmonicity, and psychoacoustic spectral sharpness. Functionals that are related
to the local extrema, such as the statistics of inter-maxima distances, have been
introduced. Useless functionals have been removed to limit the number of the audio
features. The IS-2013 feature set consists of 6,373 audio features, computed from
59 LLDs and 48 functionals. A total of 724 audio features were removed from the
IS-2012 feature set, and 972 were added. New functionals that were related to the
local extrema, such as the modeling of inter-maxima, have been introduced.
Table 18.3 summarizes the main characteristics of the used feature sets. The first
three feature sets were used for the detection of overlap, and the last feature set was
the official feature set for the detection of conflict.
18.4
Interruption Detection
From the previous statistics analyzed in Sect. 18.2 , the conflict level was shown
to be highly correlated to the mean number of interruptions (cf. Fig. 18.2 ), the
mean duration of overlap (cf. Fig. 18.3 ), and the percentage of overlap duration
(cf. Fig. 18.4 ). Detecting segments of overlap is a difficult problem without
individual microphones (Yamamoto et al. 2005 ). The main problem is due to the
nonstationary characteristics of the speech signal. An alternative approach is the
use of a microphone array (Quinlan and Asano 2007 ). In this case, the estimation of
the number of signal sources allows the detection of segments that contain more than
one source of speech. Another approach, which is applied for improving the speaker
diarization system, is the speech segmentation by a three-class hidden Markov
model (Boakye et al. 2008 ), with the three classes corresponding to nonspeech,
speech, and overlapping speech. Mel-frequency cepstral coefficients (MFCC), root
mean square (RMS) energy, and linear predictive coding (LPC) residual energy
features have been used, and they provided a precision of 66 % and a recall of
26 %. In our approach, we have chosen to develop a multi-resolution framework to
estimate the overlap duration percentage. This approach is based on the fusion of
various overlap detectors, in which each detector is estimated on the segments of a
fixed and chosen duration.
 
Search WWH ::




Custom Search