Information Technology Reference
In-Depth Information
classifiers on the two classes (N and O) on the Development set. Three audio feature
sets were compared: IS-2010, IS-2011, and IS-2012 (cf. Sect. 18.3.1 ). For all two-
class classifiers, the overlaps are more difficult to detect than the Non-Ovs, and
IS-2010 was the best feature set, with a UAR (in %) of slightly over 80 % for the
three detectors. For the architecture of the conflict detector, we chose to use only the
best two-class classifiers f N, O g _1, f N, O g _2, and f N, O g _5 with the IS-2010 audio
feature set. Our assumption is that only the best overlap classifiers are relevant for
the detection of conflict.
18.4.3
Three-Class f N, L, H g Classifiers
Previous studies presented different typologies of overlaps: overlap and backchannel
with overlap (Gravano and Hirschberg 2011 ) and competitive and collaborative
overlaps (Oertel et al. 2012 ). A backchannel indicates that the speaker produc-
ing them follows and understands the other speaker. They are generally words,
onomatopoeias, or other sounds produced in the background (Clancy et al. 1996 ).
Collaborative or competitive interruptions are manifested by speech overlap, but
only overlap from a competitive interruption can it be related to a conflict (Kur-
tiƩ et al. 2012 ). In competitive overlaps, the incoming speaker attempts to forcefully
take over the turn. In collaborative overlaps, the incoming speaker assists the current
speaker in his or her speech. We chose to build classes of LLC-Ovs and HLC-Ovs
by making the hypothesis that they would be separable acoustically and useful for
conflict detection. This choice is supported by the observation that some of the LLC-
Ovs of the Train set were backchannel with overlaps and/or collaborative overlaps.
Using relabeling, three three-class SVM classifiers ( f N, L, H g _1, f N, L, H g _2,
and f N, L, H g _5) were estimated on the Train set. Each SVM classifies a segment
of a given duration (1, 2, and 5 s) into an H, L, or N. To account for the imbalanced
class distribution, the upper-represented category (N) was down-sampled by a given
factor. A factor of 8 was applied for the f N, L, H g _1 detector, a factor of 6 for the f N,
L, H g _2 detector, and a factor of 3 for the f N, L, H g _5 detector. We investigated the
effects of different feature sets on the accuracy rate of the overlap speech detection.
Table 18.5 gives the accuracy rates of the three-class classifiers on the Development
set. Three audio feature sets were compared: IS-2010, IS-2011, and IS-2012. IS-
2010 was the best feature set for f N, L, H g _1, having a UAR of 61.1 %. IS-2011
was the best feature set for f N, L, H g _2, with a UAR of 61.3 %. IS-2010 was the best
feature set for f N, L, H g _5, with a UAR of 63.5 %. The LLC-Ovs are more difficult
to detect than the HLC-Ovs. Furthermore, the detection rate of the LLC-Ovs appears
to decrease with the duration of the analyzed segment: 44.7 % for f N, L, H g _5
(5 s), 35.9 % for f N, L, H g _2 (2 s), and 31.7 % for f N, L, H g _1 (1 s). A possible
explanation would be that the detector f N, L, H g _5 allows a better estimation of the
overlap durations than the other detectors and, consequently, a better discrimination
of the LLC- and HLC-Ovs. Indeed, the duration of the LLC-Ovs is lower on average
than the HLC-Ovs (1.98 s vs. 2.75 s). For the architecture of the conflict detector,
Search WWH ::




Custom Search