Link Analysis - Mining of Massive Datasets

Databases Reference

In-Depth Information

Example 5.13 : A typical department at a university maintains a Web page

listing all the courses offered by the department, with links to a page for each

course, telling about the course - the instructor, the text, an outline of the

course content, and so on. If you want to know about a certain course, you

need the page for that course; the departmental course list will not do. The

course page is an authority for that course. However, if you want to find out

what courses the department is offering, it is not helpful to search for each

courses' page; you need the page with the course list first. This page is a hub

for information about courses.

2

Just as PageRank uses the recursive definition of importance that “a page

is important if important pages link to it,” HITS uses a mutually recursive

definition of two concepts: “a page is a good hub if it links to good authorities,

and a page is a good authority if it is linked to by good hubs.”

5.5.2

Formalizing Hubbiness and Authority

To formalize the above intuition, we shall assign two scores to each Web page.

One score represents the hubbiness of a page - that is, the degree to which it

is a good hub, and the second score represents the degree to which the page

is a good authority. Assuming that pages are enumerated, we represent these

scores by vectors h and a. The ith component of h gives the hubbiness of the

ith page, and the ith component of a gives the authority of the same page.

While importance is divided among the successors of a page, as expressed by

the transition matrix of the Web, the normal way to describe the computation

of hubbiness and authority is to add the authority of successors to estimate

hubbiness and to add hubbiness of predecessors to estimate authority. If that

is all we did, then the hubbiness and authority values would typically grow

beyond bounds. Thus, we normally scale the values of the vectors h and a so

that the largest component is 1. An alternative is to scale so that the sum of

components is 1.

To describe the iterative computation of h and a formally, we use the link

matrix of the Web, L. If we have n pages, then L is an n×n matrix, and L ij = 1

if there is a link from page i to page j, and L ij = 0 if not. We shall also have

need for L T , the transpose of L. That is, L ij = 1 if there is a link from page

j to page i, and L ij = 0 otherwise. Notice that L T is similar to the matrix M

that we used for PageRank, but where L T has 1, M has a fraction - 1 divided

by the number of out-links from the page represented by that column.

Example 5.14 : For a running example, we shall use the Web of Fig. 5.4,

which we reproduce here as Fig. 5.18. An important observation is that dead

ends or spider traps do not prevent the HITS iteration from converging to a

meaningful pair of vectors. Thus, we can work with Fig. 5.18 directly, with

no “taxation” or alteration of the graph needed.

The link matrix L and its

transpose are shown in Fig. 5.19.

2

Mining of Massive Datasets

Search WWH ::

Custom Search

Home