Graphics Reference
In-Depth Information
et al., 2012; Jokinen et al., in print). Research into these areas not
only increases our understanding of the particular phenomena,
but also improves algorithms and techniques for their automatic
recognition and analysis. This work concerns the development of
principles and methods for use when creating and coding multimodal
resources, as well as for processing and managing resources within
new technological situations.
Recently several challenges have been introduced so as to
foster research and development on multimodality, e.g. the ICMI-
2012 Grand Challenges (http://www.acm.org/icmi/2012/index.
php?id=challenges) and special challenge sessions at the forthcoming
speech and multimodal conferences on topics such as paralinguistics,
emotions, autism, engagements, gestures and data-sets. The popularity
of such events is apparently related to the shared tasks the performances
of which can be objectively measured, and to the shared annotated
data which allow comparison and evaluation of the algorithms whilst
acknowledging the complexities of multimodal problems.
Manual annotation is expensive and time-consuming, and new
technology has also boosted automatic analysis of the data. Speech
technology can be used to annotate spoken dialogs, whilst image
processing techniques can be used for face and hand gesture recognition
on video files (Jongejan, 2012; Toivio and Jokinen, 2012). Exciting new
possibilities are available with the help of motion capture devices such
as Kinect, cf. experiments described in Csapo et al. (2012).
An interesting question concerns the defining of cross-modality
annotation categories and the minimal units suitable for anchoring
correlations. A commonly used spoken correlate for gestures is the
word, but when studying, e.g. speech and gesture synchrony, this
seems to be too big a unit: the emphasis by a gesture stroke (Kendon,
2004) seems to co-occur with vocal stress which lands on an intonation
phrase corresponding to a syllable or a mora, rather than a whole word.
When designing natural communication for intelligent agents, such
cross-modal timing phenomena become relevant as the delay in the
expected synchrony may lead to confusion or total misunderstanding
of the intended message.
REFERENCES
Abdel Hady, M. and F. Schwenker. 2010. Combining Committee-Based Semi-
Supervised Learning and Active Learning. Journal of Computer Science and
Technology, 25(4): 681-698.
Search WWH ::




Custom Search