Information Technology Reference
In-Depth Information
We also observed some tags that are non-standard English words, although we
filtered most out as not relevant to this analysis. One example is 'complexsystems'
(spelled as one word), which was kept as such, although the tags 'complex' and
'system' taken individually are also present in the set. Perhaps unsurprisingly, the
similarity computed between the tags 'complexsystems' and 'complex' is one of the
strongest between any tag pair in this set. One implication of this finding is that tag
distances could be used to find tags that have minor syntactic variance with more
well-known tags, such as 'complexsystems,' but which cannot simply be detected
by morphological stemming.
5.5
Identifying Tag Vocabularies in Folksonomies
Using Community Detection Algorithms
The previous sections analyzed the temporal dynamics of distribution convergence
and stabilization in collaborative tagging as well as some latent information
structures, like tag correlation (or folksonomy) graphs, that can be created from
these tag distributions. In this section, we look at how these folksonomy graphs
could be used to identifying shared tag vocabularies.
The problem considered in this section can be summarized as: given a hetero-
geneous set of tags (which can be represented as a folksonomy graph), how can
we partition this set into subsets of related tags? We call this problem a vocabulary
identification problem. It is important to note that we use the term 'vocabulary'
only in a restricted sense, i.e. as a collection of related terms, relevant to a specific
domain. For instance, a list of tropical diseases is a vocabulary, a list of electronic
components in a given electronic device is a vocabulary, and a list of specialized
terms connected to a given scientific subfield would all be vocabularies in our
definition. We acknowledge that structural information is difficult to extract only
from tags given the simple structure of folksonomies. Nevertheless, our approach
could still prove useful in such applications: for example, one could construct the
set of related terms as a first rough step and then a human expert (or, perhaps,
another [semi]-automated method) could be used to add more detail to the extracted
vocabulary set.
Note that the complexity-related disciplines data set (already introduced in
Sect. 5.4 ) is a useful tool to examine this question, since the initial set of tags are
heterogeneous (complexity science is, by its very nature, an interdisciplinary field),
but there are natural divisions into sub-fields, based on different criteria. This allows
easier intuitive interpretation of the obtained results (besides the mathematical
modularity criteria described below). The technique we will use in our approach is
based on the so-called 'community detection' algorithms, developed in the context
of complex systems and network analysis theory (Newman 2004). Such techniques
have been well studied at a formal level and have been used to study large-scale
networks in a variety of fields from social analysis (e.g. analysis of co-citation
Search WWH ::




Custom Search