Database Reference
In-Depth Information
communities { A, B, C } and { D, E, F, G }. However, we could also view { D, E, F } and { D,
F, G } as two subcommunities of { D, E, F, G }; these two subcommunities overlap in two of
their members, and thus could never be identified by a pure clustering algorithm. Finally,
we could consider each pair of individuals that are connected by an edge as a community
of size 2, although such communities are uninteresting.
Figure 10.3 Repeat of Fig. 10.1
The problem with hierarchical clustering of a graph like that of Fig. 10.3 is that at some
point we are likely to chose to combine B and D , even though they surely belong in differ-
ent clusters. The reason we are likely to combine B and D is that D , and any cluster con-
taining it, is as close to B and any cluster containing it, as A and C are to B . There is even a
1/9 probability that the first thing we do is to combine B and D into one cluster.
There are things we can do to reduce the probability of error. We can run hierarchical
clustering several times and pick the run that gives the most coherent clusters. We can use
a more sophisticated method for measuring the distance between clusters of more than one
node, as discussed in Section 7.2.3 . But no matter what we do, in a large graph with many
communities there is a significant chance that in the initial phases we shall use some edges
that connect two nodes that do not belong together in any large community.
Now consider a point-assignment approach to clustering social networks. Again, the fact
that all edges are at the same distance will introduce a number of random factors that will
lead to some nodes being assigned to the wrong cluster. An example should illustrate the
point.
EXAMPLE 10.4 Suppose we try a k -means approach to clustering Fig. 10.3 . As we want two
clusters, we pick k = 2. If we pick two starting nodes at random, they might both be in the
same cluster. If, as suggested in Section 7.3.2 , we start with one randomly chosen node and
then pick another as far away as possible, we don't do much better; we could thereby pick
any pair of nodes not connected by an edge, e.g., E and G in Fig. 10.3 .
However, suppose we do get two suitable starting nodes, such as B and F . We shall then
assign A and C to the cluster of B and assign E and G to the cluster of F . But D is as close
to B as it is to F , so it could go either way, even though it is “obvious” that D belongs with
F .
If the decision about where to place D is deferred until we have assigned some other
nodes to the clusters, then we shall probably make the right decision. For instance, if we
Search WWH ::




Custom Search