Weaving the World Wide Web - The Computing Universe: A Journey through a Revolution - page 235

Information Technology Reference

In-Depth Information

hypertext links from page to page. When he or she arrives at a web page

with multiple links, the surfer randomly chooses one of the links. We can

now attempt to calculate the importance of each page by counting how many

times the random surfer visits each site. To avoid the problem of buckets -

cycles where the surfer can go around forever - Page and Brin introduced a

“teleportation” probability. At each page the surfer visits, there is a chance

that he or she does not follow one of the links from the page but instead

jumps to an entirely new web page chosen at random. The surfer then uses

this page as a new starting point for continued surfing. In addition, to avoid

the problem of the surfer getting stuck in a so-called dangling node , a page

with no outgoing web links, Page and Brin also allowed the surfer to jump to

a random new page from such nodes. We can simulate the action of the ran-

dom surfer on a network of web links and find out which pages the surfer vis-

its most often. The importance score calculated in this way takes into account

both the number of hyperlinks pointing to a web page and the authority of

the linking web page. The method works in the presence of link cycles and

dangling nodes and is able to deliver meaningful and stable values for each

page's importance.

Figure 11.22a shows a network of eleven web pages and their associ-

ated links. A computer simulation of the random surfer algorithm gives the

PageRank score shown in Figure 11.22b . In practice, the computation of the

PageRank score is not done by performing a computer simulation of the surf-

er's travels. Instead, the problem can be formulated as solving a mathematical

problem involving sparse matrices . A matrix is an array of numbers or symbols

arranged in rows and columns, and sparse just means that the matrix has many

elements that are zero. The matrices concerned are huge, with many billions

of rows, but fortunately fast computational methods exist that can be used to

find the PageRank scores.

Armed with the PageRank algorithm, by early 1997 Page had developed

a prototype search engine that he called “BackRub” because it analyzed the

incoming or “back” links to web pages to calculate a page's importance. While

other searches relied on the content and structure indexes, BackRub added the

dimension of importance to rank pages in order of likely relevance. By the end

of the year, Page and his officemate at Stanford, Sean Anderson, were brain-

storming for a new name for the search engine. They decided on the name

Google, thinking the word google meant a very large number. Page registered

Google.com the same evening only to find out next day that the word they

were thinking of was spelled googol , meaning the number 1 followed by one

hundred zeros. With their search engine now being used by Stanford students

and faculty, Page and Brin needed to beg and borrow as many computers as

they could to keep up with both the growth of users and that of the web. Page's

dorm room became the first Google data center, crammed with inexpensive

PCs ( Fig. 11.23 ). Today Google has data centers around the world that run hun-

dreds of thousands of servers ( Fig. 11.24 ).

Having demonstrated the superiority of a search engine using the PageRank

algorithm to calculate a web page's importance, Page and Brin tried to inter-

est AltaVista in buying the rights to their system. Paul Flaherty, one of the

originators of the AltaVista search engine, was impressed with their link-based

approach to page ranking. Nevertheless, Flaherty came back to them saying

L

M

H

I

G

E

D

F

B

A

C

(a)

H(1.62)

I(1.62)

G(1.62)

L(1.62)

M(1.62)

E(8.08)

D(3.91)

F(3.91)

A(3.28)

B(38.42)

C(34.31)

(b)

Fig. 11.22. (a) An example of a set of web

pages - nodes - showing the hypertext

links connecting the different pages with

the direction of each link marked by an

arrow. Web page A is known as a dan-

gling node because this has no outgoing

links. Pages B and C are said to form a

bucket , a reachable strongly connected

pair of pages with no outgoing links to

the rest of the pages of the graph. (b)

The importance of each page can be

calculated using the PageRank algorithm.

For this graph we have calculated the

importance of each page using the ran-

dom surfer method although in practice

other methods are used. The scores have

been normalized to add up to 100. Page

C receives just one link but this is from

the most important page. Page C ends

up being more important than page

E - which receives more links but all but

one from pages G, H, M, L, and I, which

all have no incoming links and therefore

have the minimum importance.

Next Page

The Computing Universe: A Journey through a Revolution

Search WWH ::

Custom Search

Home