Information Technology Reference
In-Depth Information
One of the earliest works of the automatic detection of “hot spots” in multi-
party conversations was presented by Wrede and Shriberg ( 2003 ). They investigated
whether involvement can be judged reliably by human listeners and found that
despite the subjective nature of the task, raters show significant agreement in
distinguishing involved from non-involved utterances. Furthermore, they found
temporal trajectories of fundamental frequency (F0) and energy to be reliable
acoustic cues for involvement.
Bousmalis et al. ( 2009 ) presented a survey of audio-visual cues of agreement
and disagreement. The most relevant features were found to be visual cues, such as
head gestures, facial and hand actions. However, also auditory cues were shown
to be important, such as sighing and throat clearing, but also utterance lengths,
interruptions, delays, and pauses in speech.
A semi-automatic approach aimed at detecting conflict in conversations was
proposed in Pesarin et al. ( 2012 ). In that study the authors adopted a generative sta-
tistical technique based on Markov chains, capable of identifying turn-organization
regularities associated with conflict.
Kim et al. ( 2012 ) instead adopted a careful selection of features and employed
three different types of regression models, namely Bayesian Linear Regression,
Gaussian Processes for Regression, and Support Vector Regression, for both manual
and automatic diarization. They later extended their work including the detection of
conflict escalation during the course of conversation (Kim et al. 2012 ).
The Conflict Sub-Challenge of the Interspeech 2013 Computational Paralinguis-
tics Challenge (Schuller et al. 2012 ) introduced a benchmark data set to allow for the
objective comparison of approaches on the detection of conflict and the prediction
of conflict level. Based on this data set Grèzes et al. ( 2013 ) studied the effect of
overlap for the automatic detection of conflict. They found the overlap ratio , i.e.
the ratio of overlapping speech to non-overlapping speech, to be the single best
feature for predicting the conflict level; using the predicted overlap ratio improved
the detection performance even more. They further investigated the effect of conflict
escalation or de-escalation on the prediction, but found that this did not lead to any
improvements.
Given the importance of overlapping speech a number of studies have presented
approaches on how to robustly estimate overlapping speech segments in multi-party
conversations.
Yamamoto et al. ( 2006 ) employed Support Vector Machines (SVMs) and Support
Vector Regression using microphone arrays in order to estimate the number of
sound sources. Boakye et al. ( 2011 ) applied a feature analysis technique called
Discriminant Capability Analysis achieving almost oracle performance on their
database. Zelenák and Hernando ( 2011 ) successfully adopted a set of prosody-based
long-term features as a complement to short-term spectral parameters, followed by
a feature selection process according to a minimal-redundancy-maximal-relevance
(mRMR) criterion.
A completely different approach was followed by Geiger et al. ( 2012 )using
a combination of features derived from convolutive non-negative sparse coding
within a conventional Hidden Markov Model (HMM) system. This reduced the
Search WWH ::




Custom Search