The Semantics of Tagging - Social Semantics: The Search for Meaning on the Web

Information Technology Reference

In-Depth Information

5.6

Comparing Tags to Search Keywords

While these applications of tagging distributions have shown promise, one question

that can be reasonably asked is how well these applications of tagging compare

to some benchmark that does not use tagging distributions? In other words, is

the notion of a Fregean sense inherently limited to only tags explicitly created in

tagging systems? The most compelling other in which natural language terms are

attached to URIs is that of search engines. One can consider the query terms of a

user in a search engine as the implicit tagging of a resource, as is done in what has

been termed 'query flow graphs' (Poblete and Baeza-Yates 2008). Thus, the main

difference between search engine terms and tags is that in search engines natural

language terms are used to discover a resource pre-discovery , while tagging are

terms attached to a resource post-hoc. Regardless, this also means that the Fregean

notion of a sense does not have to be confined to the collective tags attached to

a resource, but can include search terms as well. However, as the data for the

stabilization of search terms is not publically available like tagging systems, for the

time being we will have to compare tagging to search terms using the more limited

correlation graph techniques.

The idea of approximating semantics by using search engine data has, in fact,

been proposed before, and is usually found in existing literature under the name

of “Google distance.” Cilibrasi and Vitanyi (2007) were the first to introduce the

concept of “Google distance” from an information-theoretic standpoint, while other

researchers (Gligorov et al. 2008) have recently proposed using it for tasks such as

approximate ontology matching. It is fair to assume (although we have no way of

knowing this with certainty), that current search engines and related applications,

such as Google Sets also use text or query log mining techniques (as opposed to

collaborative tagging) to solve similar problems.

There are two ways of comparing terms (in this case, keywords) using a search

engine. One method would be to compare the number of resources that are retrieved

using each of the keywords and their combinations. Another method is to use the

query log data itself, where the co-occurrence of the terms in the same queries vs.

their individual frequency is the indicator of semantic distance. We employ this

latter method as it is more amendable to comparison with our work on tagging. In

the latter method, the query terms are comparable to tags, where instead of basing

our folksonomy graphs and vocabulary extraction on tags, we used query terms. In

general, query log data is considered proprietary and much more difficult to obtain

than tagging data. We were fortunate to have access to a large-scale data set of

query log data, from two separate proposals awarded through Microsoft's “Beyond

Search” awards. In the following we describe our methodology and empirical

results.

Search WWH ::

Custom Search

Home