Weaving the World Wide Web - The Computing Universe: A Journey through a Revolution - page 233

Information Technology Reference

In-Depth Information

Other companies took a different approach by providing web search engines .

These companies attempted to supply an index to the content of web pages at

the different sites. A search engine works just like the index in a topic in help-

ing the reader look up a particular topic. By 1998 the leading search engine,

with more than 50 percent market share, was AltaVista. Computer scientist

Paul Flaherty at Digital Equipment Corporation's (DEC's) Network Systems

Laboratory in Palo Alto had the idea of DEC building a web index. He recruited

colleagues Louis Monier and Michael Burrows to write the software for what

became the AltaVista search engine.

To create indexes for individual web pages, a search engine must first

search out and capture these web pages ( Fig. 11.19 ). This search is done with

a web crawler , a piece of software that follows hypertext links to discover new

web pages. The crawler sends out “spiders,” which are given explicit instruc-

tions on where to start crawling and what strategy to use in following links to

visit new pages.

The web pages returned by the spiders now need to be indexed. The indexing

software takes each new page, extracts key information from it, and stores a

compressed description of the page in one or more indexes. The first type of

index is called the content index . This directory stores information about the dif-

ferent words on the page in a structure known as an “inverted file,” which is

similar to the index in the back of a topic. Next to each term being indexed, the

inverted file keeps information, such as the page numbers on which the term

appears. We can now do single-word queries to find the relevant web pages. Of

course, to efficiently handle more complex queries, we need to store more than

just the page number for each word. We can add extra information, such as the

number of times a word appears on a page, its location on the web page, and

so on. A key advance made by AltaVista was also to store information about the

HTML structure of the web page. By looking at the HTML tags on the page, we

can identify whether the word being queried appears in the title, in the body

of the page, or in the anchor text , the specific word or words used to represent

the hypertext link. All of this indexed information is combined to deliver an

overall “content score” for each web page to determine the most relevant page

in answer to a query. It was this combination of content and structure informa-

tion that had made AltaVista the leading search engine by 1998.

Modern search engines use more than just the content and structure to

determine the best websites to return in answer to a user's query. It was the

development of the PageRank algorithm by two Stanford graduate students,

Web

Crawler

Page repository

Query module

Ranking module

Indexing module

Indexes

Content

Structure

Special-purpose

Fig. 11.19. Basic structure of a search

engine showing the query-independent

elements that respond to a user's query.

B.11.9. David Filo and Jerry Yang founded the search company Yahoo! Inc.

as Stanford graduate students. In 1994 they started compiling a directory of

websites and extended the portal with a range of online services. By 1996

the company went public and became one of the landmark successes of the

dot-com era. After the dot-com crash in 2000, the company suffered significant

losses but Yahoo remains one of the household names of the Internet age,

delivering online services to millions of customers.

Next Page

The Computing Universe: A Journey through a Revolution

Search WWH ::

Custom Search

Home