Information Technology Reference
In-Depth Information
5.6
Comparing Tags to Search Keywords
While these applications of tagging distributions have shown promise, one question
that can be reasonably asked is how well these applications of tagging compare
to some benchmark that does not use tagging distributions? In other words, is
the notion of a Fregean sense inherently limited to only tags explicitly created in
tagging systems? The most compelling other in which natural language terms are
attached to URIs is that of search engines. One can consider the query terms of a
user in a search engine as the implicit tagging of a resource, as is done in what has
been termed 'query flow graphs' (Poblete and Baeza-Yates 2008). Thus, the main
difference between search engine terms and tags is that in search engines natural
language terms are used to discover a resource pre-discovery , while tagging are
terms attached to a resource post-hoc. Regardless, this also means that the Fregean
notion of a sense does not have to be confined to the collective tags attached to
a resource, but can include search terms as well. However, as the data for the
stabilization of search terms is not publically available like tagging systems, for the
time being we will have to compare tagging to search terms using the more limited
correlation graph techniques.
The idea of approximating semantics by using search engine data has, in fact,
been proposed before, and is usually found in existing literature under the name
of “Google distance.” Cilibrasi and Vitanyi (2007) were the first to introduce the
concept of “Google distance” from an information-theoretic standpoint, while other
researchers (Gligorov et al. 2008) have recently proposed using it for tasks such as
approximate ontology matching. It is fair to assume (although we have no way of
knowing this with certainty), that current search engines and related applications,
such as Google Sets also use text or query log mining techniques (as opposed to
collaborative tagging) to solve similar problems.
There are two ways of comparing terms (in this case, keywords) using a search
engine. One method would be to compare the number of resources that are retrieved
using each of the keywords and their combinations. Another method is to use the
query log data itself, where the co-occurrence of the terms in the same queries vs.
their individual frequency is the indicator of semantic distance. We employ this
latter method as it is more amendable to comparison with our work on tagging. In
the latter method, the query terms are comparable to tags, where instead of basing
our folksonomy graphs and vocabulary extraction on tags, we used query terms. In
general, query log data is considered proprietary and much more difficult to obtain
than tagging data. We were fortunate to have access to a large-scale data set of
query log data, from two separate proposals awarded through Microsoft's “Beyond
Search” awards. In the following we describe our methodology and empirical
results.
Search WWH ::




Custom Search