The Semantics of Tagging - Social Semantics: The Search for Meaning on the Web

Information Technology Reference

In-Depth Information

factor Q and in the number of partitions, as the number of edges filtered decreases

(from left to right, in our figure). The filtering decision represents, in fact, a trade-

off. Having too many edges in the graph may stop us from finding a partition with

a reasonable modularity, due to the high volume of noise represented by weaker

edges. However, keeping only a small proportion of the strongest edges (e.g. 100

or 50 for a 50-tag graph, in our example), may also have disadvantages, since we

risk throwing away useful information. While a high modularity partition can be

obtained this way, the graph may become too fragmented: arguably, dividing 50

tags into 10 or 15 vocabularies may not be a very useful.

Note that it is difficult to establish a general rule for what a 'good' or universally

'correct' partition should be in this setting. For example, even the trivial partition

that assigns each tag to its own individual cluster cannot be rejected as “wrong” but

such a trivial partition would not be considered a useful result for most purposes. In

this paper we generally report the partitions found to have the highest modularity for

the setting. However, for many applications, having a partition with a certain number

of clusters, or some average cluster size, may be more desirable. The clustering

algorithm proposed here (Algorithm 1 ) can be easily modified to account for such

desiderata, by changing the stop criteria in line 9.

Figure 5.13 shows the solution with the highest modularity Q for a graph with

200 edges, in which seven clusters are identified. This partition assigns tags related

to mathematics and computer science to Cluster 1, tags related to social science and

phenomena to Cluster 2, complexity-related topics to Cluster 4 etc., while “art” is

assigned to its own individual cluster. This matches quite well our intuition, and its

modularity Q

=

0

.

34 is above (albeit close) to the theoretical relevance threshold

of 0.3.

5.5.5.1

Eliminating Tags from Resulting Partitions to Improve Modularity

The analysis in the previous section shows that community detection algorithms

were able to produce useful partitions, with above-relevance modularity. Still,

there are a few general-meaning tags that would fit well into any of the subsets

resulting after the partition. These tags generally reduce the Q modularity measure

significantly, since they increase the inter-cluster edges. Therefore, we hypothesized

that the modularity of the resulting partitions could be greatly improved by removing

just a few tags from the set under consideration. In order to test this hypothesis, we

tested another greedy tag elimination algorithm, formally defined as Algorithm 2 .

Result graphs are shown in Fig. 5.15 , while in Fig. 5.13 we show the top 5 tags that,

if eliminated, would increase modularity Q from 0.34 to 0.43.

As seen in Fig. 5.2 , for this data set only 5-6 tags need to be eliminated

as eliminating more does not lead to a further increases in Q . In the example

in Fig. 5.13 , we see which these are, in order of elimination: theory, science,

research, simulation, networks. In fact, these tags, that are marked for elimination

automatically by Algorithm 2 , are exactly those that are the most general in meaning

and would fit well into any of the subsets.

Social Semantics: The Search for Meaning on the Web

Search WWH ::

Custom Search

Home