Information Technology Reference
In-Depth Information
is referred to as the IS-2013 set, was a representation of the utterances, and the
complexity parameter of the SVM classifier was optimized by using UAR on the
Development set.
18.3.1
Audio Feature Sets
In this section, we describe the audio feature sets that we used for analyzing speech
segments. This speech representation (Vogt and André 2005 ; Schuller et al. 2008 )
is a new paradigm for speech analysis. It contrasts with the standard paradigm for
speech analysis (the sequence of observation vectors): regardless of its duration, a
speech utterance is represented by a large set of features, which is termed an audio
feature set. The feature set is based on several low-level descriptors (LLDs) that are
computed from short overlapping windows of the audio signal. These LLDs com-
prise the loudness, the harmonics-to-noise ratio, the zero-crossing rate, the spectral
and prosodic coefficients, the formant positions and bandwidths, the duration of
voiced/unvoiced speech segments, and features derived from the long-term average
spectrum such as band energies, roll-off, and centroid as well as voice quality
features such as jitter and shimmer. Various global statistical functions (functionals)
are computed on these LLDs to obtain feature vectors of equal size for each speech
utterance. The sequence of LLDs that are associated with speech utterances can
have different lengths, depending on the duration; the use of functionals allows us to
obtain one feature vector per speech utterance, with a constant number of elements.
It avoids the use of the expensive procedures of time warping between sequences
of different lengths such as dynamic programming algorithms. Some functionals
aim at estimating the spatial variability (e.g., mean, standard deviation, quartiles
1-3), and others aim at the temporal variability (e.g., peaks, linear regression
slope). The four audio feature sets that we used for our experiments include the
set of features that was provided by the organizers of the Interspeech 2010 (IS-
2010) Paralinguistic Challenge (Schuller et al. 2010 ), the set of features for the
Interspeech 2011 (IS-2011) Speaker State Challenge (Schuller et al. 2011 ), the set
of features for the Interspeech 2012 (IS-2012) Speaker Trait Challenge (Schuller
et al. 2012 ), and the set of features for the Interspeech 2013 (IS-2013) Conflict
Sub-Challenge (Schuller et al. 2013 ). All of the features were extracted using the
open source openSMILE feature extraction tools (Eyben et al. 2010 ). The IS-2010
feature set consists of 1,582 audio features, which were computed from 38 LLDs
and 21 functionals. The spectral features include loudness, mel-frequency cepstral
coefficients, mel-frequency band energy, and line spectral pair frequencies. The
prosodic and voice quality features comprise the pitch frequency and envelope, jitter,
and shimmer. Functionals such as the mean, standard deviation, kurtosis, skewness,
minimum and maximum value, relative position, linear regression coefficients, and
quartile and percentile coefficients were applied on the LLDs. The IS-2011 feature
set consists of 4,368 audio features, which were computed from 59 LLDs and
39 functionals. Additional LLDs, such as the auditory spectrum-derived loudness
Search WWH ::




Custom Search