Information Technology Reference
In-Depth Information
factor Q and in the number of partitions, as the number of edges filtered decreases
(from left to right, in our figure). The filtering decision represents, in fact, a trade-
off. Having too many edges in the graph may stop us from finding a partition with
a reasonable modularity, due to the high volume of noise represented by weaker
edges. However, keeping only a small proportion of the strongest edges (e.g. 100
or 50 for a 50-tag graph, in our example), may also have disadvantages, since we
risk throwing away useful information. While a high modularity partition can be
obtained this way, the graph may become too fragmented: arguably, dividing 50
tags into 10 or 15 vocabularies may not be a very useful.
Note that it is difficult to establish a general rule for what a 'good' or universally
'correct' partition should be in this setting. For example, even the trivial partition
that assigns each tag to its own individual cluster cannot be rejected as “wrong” but
such a trivial partition would not be considered a useful result for most purposes. In
this paper we generally report the partitions found to have the highest modularity for
the setting. However, for many applications, having a partition with a certain number
of clusters, or some average cluster size, may be more desirable. The clustering
algorithm proposed here (Algorithm 1 ) can be easily modified to account for such
desiderata, by changing the stop criteria in line 9.
Figure 5.13 shows the solution with the highest modularity Q for a graph with
200 edges, in which seven clusters are identified. This partition assigns tags related
to mathematics and computer science to Cluster 1, tags related to social science and
phenomena to Cluster 2, complexity-related topics to Cluster 4 etc., while “art” is
assigned to its own individual cluster. This matches quite well our intuition, and its
modularity Q
=
0
.
34 is above (albeit close) to the theoretical relevance threshold
of 0.3.
5.5.5.1
Eliminating Tags from Resulting Partitions to Improve Modularity
The analysis in the previous section shows that community detection algorithms
were able to produce useful partitions, with above-relevance modularity. Still,
there are a few general-meaning tags that would fit well into any of the subsets
resulting after the partition. These tags generally reduce the Q modularity measure
significantly, since they increase the inter-cluster edges. Therefore, we hypothesized
that the modularity of the resulting partitions could be greatly improved by removing
just a few tags from the set under consideration. In order to test this hypothesis, we
tested another greedy tag elimination algorithm, formally defined as Algorithm 2 .
Result graphs are shown in Fig. 5.15 , while in Fig. 5.13 we show the top 5 tags that,
if eliminated, would increase modularity Q from 0.34 to 0.43.
As seen in Fig. 5.2 , for this data set only 5-6 tags need to be eliminated
as eliminating more does not lead to a further increases in Q . In the example
in Fig. 5.13 , we see which these are, in order of elimination: theory, science,
research, simulation, networks. In fact, these tags, that are marked for elimination
automatically by Algorithm 2 , are exactly those that are the most general in meaning
and would fit well into any of the subsets.
Search WWH ::




Custom Search