Information Technology Reference
In-Depth Information
just 6. The basic reason is that no relationship between the term 'complexity' and
the other 40
terms can be inferred from the query log data. These relationships
either do not appear in the query logs or are statistically too weak (only based on a
few instances).
It is important to emphasize here that this result is not an artifact of the cosine
similarity measure we use. Even if we use another, more complex distance measure
between keywords, such as some suggested in the previous literature (Cilibrasi
and Vitanyi 2007), we get very similar results. The fundamental reason for the
sparseness of the resulting graph is that the query log data itself does not contain
enough relevant information about complexity-related disciplines. For example,
among the 101,000,000 queries, the term complexity appears exactly 138 times,
a term such as 'networks' 1,074 times. Important terms such as 'cognition' or
'semantics' are even less common, featuring only 47 and 26 times respectively
among more than 100 million queries. Therefore, it is fair to conclude that the
query log data, while very large in size, is quite poor in useful information about
the complexity-related sciences domain. As a caveat, we do note that more common
terms, such as 'community' (78,862 times), 'information' (36,520 times), 'art' (over
52,000), or even 'agent' (about 7,000) do appear more frequently, but these words
have a more general language usage and are not restricted to the scientific domain.
Therefore, these higher frequencies do not actually prove very useful for identifying
the relationship of these terms to complexity science, which was our initial target
question.
Turning our attention to the second graph in Fig. 5.17 and the partition in
Fig. 5.18 , we can see that query logs can also produce good results in comparison
with tagging, although they are somewhat different from the ones obtained from
tagging. For example, if we compare the partitions obtained in Fig. 5.13 (resulting
from tagging data) and the one in Fig. 5.18 (from query log data), we see that
tagging produces a more precise partition of the disciplines into scientific sub-fields.
For instance, it is clear from Fig. 5.13 that cluster 1 corresponds to mathematics,
optimization and computation, cluster 2 to markets and economics, cluster 5 to
biology and genetics, cluster 4 to disciplines very related to complexity science
and so forth. The partition obtained from query log data in Fig. 5.18 , while still
very reasonable, reflects perhaps how a general user would classify the disciplines,
rather than a specialist: organization is related to both information, systems and
community (cluster 2), research is either qualitative or quantitative (cluster 6), and
the like. There are also some counter-intuitive associations, such as putting biology
and markets in the same cluster (cluster 1). Note that the clustering (or modularity)
coefficient Q is higher in Fig. 5.18 than 5.13 , but this is only because there are less
inter-connections between terms in general in the query log data, thus there are less
edges to 'cut' in the clustering algorithm.
+
Search WWH ::




Custom Search