Mining Social-Network Graphs - Mining of Massive Datasets

Database Reference

In-Depth Information

✦ The Girvan-Newman Algorithm : The Girvan-Newman Algorithm is an efficient technique for computing the

betweenness of edges. A breadth-first search from each node is performed, and a sequence of labeling steps com-

putes the share of paths from the root to each other node that go through each of the edges. The shares for an edge

that are computed for each root are summed to get the betweenness.

✦ Communities and Complete Bipartite Graphs : A complete bipartite graph has two groups of nodes, all possible

edges between pairs of nodes chosen one from each group, and no edges between nodes of the same group. Any suf-

ficiently dense community (a set of nodes with many edges among them) will have a large complete bipartite graph.

✦ Finding Complete Bipartite Graphs : We can find complete bipartite graphs by the same techniques we used for find-

ing frequent itemsets. Nodes of the graph can be thought of both as the items and as the baskets. The basket corres-

ponding to a node is the set of adjacent nodes, thought of as items. A complete bipartite graph with node groups of

size t and s can be thought of as finding frequent itemsets of size t with support s .

✦ Graph Partitioning : One way to find communities is to partition a graph repeatedly into pieces of roughly similar

sizes. A cut is a partition of the nodes of the graph into two sets, and its size is the number of edges that have one

end in each set. The volume of a set of nodes is the number of edges with at least one end in that set.

✦ Normalized Cuts : We can normalize the size of a cut by taking the ratio of the size of the cut and the volume of each

of the two sets formed by the cut. Then add these two ratios to get the normalized cut value. Normalized cuts with a

low sum are good, in the sense that they tend to divide the nodes into two roughly equal parts, and have a relatively

small size.

✦ Adjacency Matrices : These matrices describe a graph. The entry in row i and column j is 1 if there is an edge between

nodes i and j , and 0 otherwise.

✦ Degree Matrices : The degree matrix for a graph has d in the i th diagonal entry if d is the degree of the i th node. Off

the diagonal, all entries are 0.

✦ Laplacian Matrices : The Laplacian matrix for a graph is its degree matrix minus its adjacency matrix. That is, the

entry in row i and column i of the Laplacian matrix is the degree of the i th node of the graph, and the entry in row i

and column j , for i ≠ j , is −1 if there is an edge between nodes i and j , and 0 otherwise.

✦ Spectral Method for Partitioning Graphs : The lowest eigenvalue for any Laplacian matrix is 0, and its corresponding

eigenvector consists of all 1's. The eigenvectors corresponding to small eigenvalues can be used to guide a partition

of the graph into two parts of similar size with a small cut value. For one example, putting the nodes with a positive

component in the eigenvector with the second-smallest eigenvalue into one set and those with a negative component

into the other is usually good.

✦ Overlapping Communities : Typically, individuals are members of several communities. In graphs describing social

networks, it is normal for the probability that two individuals are friends to rise as the number of communities of

which both are members grows.

✦ The Affiliation-Graph Model : An appropriate model for membership in communities is to assume that for each com-

munity there is a probability that because of this community two members become friends (have an edge in the

social network graph). Thus, the probability that two nodes have an edge is 1 minus the product of the probabilities

that none of the communities of which both are members cause there to be an edge between them. We then find

the assignment of nodes to communities and the values of those probabilities that best describes the observed social

graph.

✦ Maximum-Likelihood Estimation : An important modeling technique, useful for modeling communities as well as

many other things, is to compute, as a function of all choices of parameter values that the model allows, the prob-

ability that the observed data would be generated. The values that yield the highest probability are assumed to be

correct, and called the maximum-likelihood estimate (MLE).

✦ Use of Gradient Descent : If we know membership in communities, we can find the MLE by gradient descent or

other methods. However, we cannot find the best membership in communities by gradient descent, because mem-

bership is discrete, not continuous.

✦ Improved Community Modeling by Strength of Membership : We can formulate the problem of finding the MLE of

communities in a social graph by assuming individuals have a strength of membership in each community, possibly

0 if they are not a member. If we define the probability of an edge between two nodes to be a function of their

Mining of Massive Datasets

Search WWH ::

Custom Search

Home