Graphics Reference
In-Depth Information
multimodal video analysis either ELAN (Elan) or Anvil (Kipp, 2001)
annotation tools are commonly used. The increase in the larger
corpora of annotated multimodal data has raised questions about
creating, coding, processing, and managing more extensive multimodal
resources, notably in the context of European collaboration projects.
For instance, the European Telematics project MATE (Multilevel
Annotation Tools Engineering) aimed to facilitate the use and reuse of
spoken language resources, coding schemes and tools, and produced
the NITE workbench (Bernsen et al., 2002) that addresses theoretical
issues. Dybkjaer et al. (2002) provided an overview of the tools and
standards for multimodal annotation.
2.5 Inter-coder agreement
A number of methodological recommendations have been put
forward for validating the data and ensuring coherent and reliable
annotations. The annotators' mutual agreement of the categories is
one of the standard measures, and much attention has been devoted
to it (Cavicchio and Poesio, 2009; Rietveld and van Hout, 1993). It
is important to distinguish the percentage agreement (how many
times the annotators are observed to assign the same category to the
annotation elements), from the agreement that takes into account
expected agreement (the probability that the annotators agree by
chance). Agreement beyond chance can be measured by Cohen's kappa,
coefficient M, calculated as follows:
M = ( P ( A ) - P ( E ))/(1 - P ( E ))
where P ( A ) is the proportion of times the coders agree and P ( E ) is the
proportion of times they can be expected to agree by chance. The value
of M is 1 in the case of total agreement and zero in the case of total
disagreement. According to Rietveld and van Hout (1993), M-values
above 0.8 show almost perfect agreement, those between 0.6 and 0.8
show substantial agreement, those between 0.4 and 0.6 moderate
agreement, those between 0.2 and 0.4 fair agreement, and those below
0.2 show slight agreement beyond chance. Generally, a value above
0.6 is considered satisfactory.
However, M can often be very low, whilst percentage agreement is
still quite high, and it has been argued that M may not be a suitable
statistic for this (Cavicchio and Poesio, 2009 for discussion). For
instance, if one of the coders has strong preference for a particular
category, the likelihood of the coders agreeing on that category by
chance is increased and, consequently, the overall agreement measured
Search WWH ::




Custom Search