Information Technology Reference
In-Depth Information
Fig. 5.16 Correlation graph from Microsoft queries, showing only correlations to the term
“complexity”
5.6.1
Data Set and Methodology Employed
The data set we used consists of 101,000,000 organic search queries, produced from
Microsoft search engine Live.com, during a 3-month interval in 2006. Based on this
set of queries, we computed the bilateral correlation between all pairs from the set of
complexity related terms considered in Sects. 5.4 and 5.5 above. The set of terms are,
however, no longer treated as tags, but as search keywords. 9 The correlation between
any two keywords T i and T j is computed using the cosine distance formula in ( 5.9 )
from Sect. 5.4 above. However, here N
represents the number of queries in
which the keywords T i and T j appear in together, while N
(
T i ,
T j )
are the
numbers of queries in which T i , respectively T j appear in total (irrespective of other
terms in the query), from the 101 million queries in the data set.
The rest of the analysis mirrors closely the steps described in Sects. 5.4 and
5.5 , but optimizing the learning parameters which best fit this data set, in order to
give both methods a fair chance in the comparison. More specifically, the Pajek
visualization of the keyword graphs in Figs. 5.16 and 5.17 were also built by using
a spring-embedder algorithm based on the Kamada-Kawai distance, while Fig. 5.18
shows the keyword vocabulary partition that maximizes the modularity coefficient
Q in the new setting, considering the top 200 edges. For clarity, the graph pictures
are depicted in a different color scheme, to clearly show they result from entirely
different data sets: Figs. 5.11 and 5.12 from del.icio.us collaborative tagging data,
and Figs. 5.16 and 5.17 from Microsoft's Live.com query logs.
(
T i )
and N
(
T j )
9 We acknowledge this method has some drawbacks, as a few terms in the complexity-related set,
such as 'powerlaw' and 'complexsystems' (spelled as one word) or 'alife' (for 'artificial life') are
natural to use as tags, but not very natural as search keywords. However, since there are only three
such non-word tags, they do not significantly affect our analysis.
Search WWH ::




Custom Search