Information Technology Reference
In-Depth Information
Other companies took a different approach by providing web search engines .
These companies attempted to supply an index to the content of web pages at
the different sites. A search engine works just like the index in a topic in help-
ing the reader look up a particular topic. By 1998 the leading search engine,
with more than 50 percent market share, was AltaVista. Computer scientist
Paul Flaherty at Digital Equipment Corporation's (DEC's) Network Systems
Laboratory in Palo Alto had the idea of DEC building a web index. He recruited
colleagues Louis Monier and Michael Burrows to write the software for what
became the AltaVista search engine.
To create indexes for individual web pages, a search engine must first
search out and capture these web pages ( Fig. 11.19 ). This search is done with
a web crawler , a piece of software that follows hypertext links to discover new
web pages. The crawler sends out “spiders,” which are given explicit instruc-
tions on where to start crawling and what strategy to use in following links to
visit new pages.
The web pages returned by the spiders now need to be indexed. The indexing
software takes each new page, extracts key information from it, and stores a
compressed description of the page in one or more indexes. The first type of
index is called the content index . This directory stores information about the dif-
ferent words on the page in a structure known as an “inverted file,” which is
similar to the index in the back of a topic. Next to each term being indexed, the
inverted file keeps information, such as the page numbers on which the term
appears. We can now do single-word queries to find the relevant web pages. Of
course, to efficiently handle more complex queries, we need to store more than
just the page number for each word. We can add extra information, such as the
number of times a word appears on a page, its location on the web page, and
so on. A key advance made by AltaVista was also to store information about the
HTML structure of the web page. By looking at the HTML tags on the page, we
can identify whether the word being queried appears in the title, in the body
of the page, or in the anchor text , the specific word or words used to represent
the hypertext link. All of this indexed information is combined to deliver an
overall “content score” for each web page to determine the most relevant page
in answer to a query. It was this combination of content and structure informa-
tion that had made AltaVista the leading search engine by 1998.
Modern search engines use more than just the content and structure to
determine the best websites to return in answer to a user's query. It was the
development of the PageRank algorithm by two Stanford graduate students,
Web
Crawler
Page repository
Query module
Ranking module
Indexing module
Indexes
Content
Structure
Special-purpose
Fig. 11.19. Basic structure of a search
engine showing the query-independent
elements that respond to a user's query.
B.11.9. David Filo and Jerry Yang founded the search company Yahoo! Inc.
as Stanford graduate students. In 1994 they started compiling a directory of
websites and extended the portal with a range of online services. By 1996
the company went public and became one of the landmark successes of the
dot-com era. After the dot-com crash in 2000, the company suffered significant
losses but Yahoo remains one of the household names of the Internet age,
delivering online services to millions of customers.
 
Search WWH ::




Custom Search