World Wide Web Search Technologies

INTRODUCTION

The World Wide Web now holds more than six billion pages covering almost all daily issues. The Web’s fast growing size and lack of structural style present a new challenge for information retrieval (Lawrence & Giles, 1999a). Traditional search techniques are based on users typing in search keywords which the search services can then use to locate the desired Web pages. However, this approach normally retrieves too many documents, of which only a small fraction are relevant to the users’ needs. Furthermore, the most relevant documents do not necessarily appear at the top of the query output list. Numerous search technologies have been applied to Web search engines; however, the dominant search methods have yet to be identified. This article provides an overview of the existing technologies for Web search engines and classifies them into six categories: i) hyperlink exploration, ii) information retrieval, iii) metasearches, iv) SQL approaches, v) content-based multimedia searches, and vi) others. At the end of this article, a comparative study of major commercial and experimental search engines is presented, and some future research directions for Web search engines are suggested. Related Web search technology review can also be found in Arasu, Cho, Garcia-Molina, Paepcke, and Raghavan (2001) and Lawrence and Giles (1999b).

Requirements of Web Search Engines

It is first necessary to examine what kind of features a Web search engine is expected to have in order to conduct effective and efficient Web searches. The requirements for a Web search engine are listed below in order of importance:
1. Effective and efficient location and ranking of Web documents;
2. Thorough Web coverage;
3. Up-to-date Web information;
4. Unbiased access to Web pages;
5. An easy-to-use user interface which also allows users to compose any reasonable query;
6. Expressive and useful search results; and
7. A system that adapts well to user queries.


BACKGROUND

Two different approaches are applied to Web search services: genuine search engines and directories. The difference lies in how listings are compiled.
• Search engines, such as Google, create their listings automatically.
• A directory, such as Yahoo!, depends on humans for its listings.
Some search engines, known as hybrid search engines, maintain an associated directory. Figure 1 shows the system structure of a typical search engine. Search engines traditionally consist of three components: i) the crawler, ii) the indexing software, and iii) the search and ranking software:
• A crawler is a program that automatically scans various Web sites and collects Web documents from them. Two search algorithms, breadth-first searches and depth-first searches, are widely used by crawlers to traverse the Web.
• Automatic indexing is the process of algorithmically examining information items to build a data structure that can be quickly searched. Traditional search engines utilize the following information, provided by HTML scripts, to locate the desired Web pages: i) content, ii) descriptions, iii) hyperlink, iv) hyperlink text, v) keywords, vi) page title, vii) text with a different font, and viii) the first sentence. • Query processing is the activity of analyzing a query and comparing it to indexes to find relevant items. A user enters a keyword or keywords, along with Boolean modifiers such as “and”, “or”, or “not”, into a search engine, which then scans indexed Web pages for the keywords. To determine in which order to display pages to the user, the engine uses an algorithm to rank pages that contain the keywords.

Figure 1. System structure of a Web search engine

 System structure of a Web search engine

SEARCH ENGINE TECHNOLOGIES

This section examines the existing technologies for Web search engines and classifies them into six categories: i) hyperlink exploration, ii) information retrieval, iii) metasearches, iv) SQL approaches, v) content-based multimedia searches, and vi) others.

Hyperlink Exploration

Links can be tremendously important sources of information for indexers; the creation of a hyperlink by the author of a Web page represents an implicit endorsement of the page being pointed to. This approach is based on identifying two important types of Web pages for a given topic:
• Authorities, which provide the best source of information on the topic, and
• Hubs, which provide collections of links to authorities.
Authorities and hubs are either given top ranking in the search results or used to find related Web pages. A simple method to update a non-negative authority with a weight xp and a non-negative hub with a weight yp is given by Chakrabarti et al. (1999). If a page is pointed to by many good hubs, its authority weight is updated by using the following formula:
tmp156-9_thumb
where the notation q p indicates that q links to p. Similarly, if a page points to many good authorities, its hub weight is updated via
tmp156-10_thumb

Figure 2. Expanding the root set into a base set

Expanding the root set into a base set
Unfortunately, applying the above formulas to the entire Web to find authorities and hubs is impracticable. Ideally, the formulas are applied to a small collection S of pages which contain plenty of relevant documents. The concepts of a root set and a base set have been proposed by (Kleinberg, 1999) to find S . The root set is usually constructed by collecting the t highest-ranked pages for the query from a search engine such as Google or Yahoo!. However, the root set may not contain most of the strongest authorities. A base set is therefore built by including any page pointed to by a page in the root set and any page that points to a page in the root set. Figure 2 shows an example of a root set and a base set. The previously mentioned formulas can then be applied to a much smaller set, the base set, instead of the entire Web.

Information Retrieval (IR)

IR techniques are widely used in Web document searches. Among them, relevance feedback and data clustering are two of the most popular techniques used by search engines:
• An initial query is usually a wild guess. Retrieved query results are then used to help construct a more precise query or modify the database indexes (Chang & Hsu, 1999). Two relevance feedback methods, query modification and indexing modification, can be used to improve the search. • Data clustering is used to improve the search results by dividing the whole data set into data clusters. Each data cluster contains objects of high similarity, and clusters are produced that group documents relevant to the user’s query separately from irrelevant ones. For example, the formula below gives a similarity measure:
tmp156-12_thumb
where weightik is the weight assigned to termk in a document D. (Baeza-Yates, 1992).
Relevance feedback has not so far been applied to any commercial products because it requires some interaction with users, who normally prefer to use a keyword-only interface, whereas data clustering has achieved more success since it does not require any interaction with users to achieve acceptable results.

Metasearches

Metasearch engines (Dreilinger & Howe, 1997) conduct a search using several other search engines simultaneously and then present the results in some sort of integrated format. Figure 3 shows the system structure of a metasearch engine, which consists of three major components:

Figure 3. System structure of a metasearch engine

System structure of a metasearch engine
• Dispatch: Determines to which search engines a specific query is sent. The selection is usually based on network and local computational resources, as well as the long-term performance of search engines on specific query terms.
• Interface: Adapts the user’s query format to match the format of a particular search engine, which varies from engine to engine.
• Display: Raw results from the selected search engines are integrated for display to the user. Each search engine also produces different raw results from other search engines and these must be combined to give a uniform format for ease-of-use.

SQL Approaches

Structured Query Language (SQL) approaches view the World Wide Web as a huge database where each record matches a Web page, and use SQL-like languages to support effective and flexible query processing. A typical SQL-like language syntax (Konopnicki & Shmueli, 1998) is
Query:=select AttributeList from DomainSpecifications [ where SearchConditions ];
A query example is given in the following to show the use of the language.
SQL Example. Find pages in the World Wide Web Consortium (W3C) site where the pages have fewer than 2000 bytes.
select url from http://www.w3c.org/ where bytes < 2000; url is a page’s URL and each page has attributes such as bytes, keywords, and text.
Various SQL-like languages have been proposed for Web search engines. The methods introduced previously treat the Web as a graph of discrete objects; another object-oriented approach (Arocena & Mendelzon, 1998) considers the Web as a graph of structured objects. However, neither approach has achieved much success because of its complicated syntax, especially for the latter method.

Content-Based Multimedia Searches

Only a few multimedia search engines are available currently, most of which use name or keyword matching where the keywords are entered by Web reviewers rather than using automatic indexing. The low number of content-based multimedia search engines is mainly due to the difficulty of automated multimedia indexing. Numerous multimedia indexing methods have been suggested in the literature (Yoshitaka & Ichikawa, 1999), yet most do not meet the efficiency requirements of Web multimedia searches, where users expect both a prompt response and the search of a huge volume of Web multimedia data. A few content-based image and video search engines are available online (Lew, 2000). However, a de facto Web image or video search engine is still out of reach because the system’s key component image or video collection and indexing is either not yet fully automated or not practicable. Similarly, effective Web audio search engines have yet to be constructed since audio information retrieval (Foote, 1999) is considered to be one of the most difficult challenges for multimedia retrieval.

Others

Apart from the above major search techniques, some ad hoc methods worth mentioning include:
• Work aimed at making the components needed for Web searches more efficient and effective, such as better ranking algorithms and more efficient crawlers. Zhang and Dong (2000) propose a ranking algorithm based on a Markov model, which synthesizes the relevance, authority, integrativity, and novelty of each Web resource, and can be computed efficiently through solving a group of linear equations.
• Various enhanced crawlers can be found in the literature (Aggarwal, Al-Garawi, & Yu, 2001). Some crawlers are extensible, personally customized, relocatable, scalable, and Web-site-specific (Heydon & Najork, 1999).
• Artificial Intelligence (AI) can also be used to collect and recommend Web pages. The Webnaut system (Nick & Themis, 2001) learns the user’s interests and can adapt as his or her interests change over time. The learning process is driven by user feedback to an intelligent agent’s filtered selections.

MAJOR SEARCH ENGINES

Some of the currently available major commercial search engines are listed in Table 1, although many table entries are incomplete as some of the information is classified as confidential due to business considerations (Sullivan, 2003). Most search services are backed up by or are cooperating with several other services. This is because an independent or stand-alone service contains less information and thus tends to lose its users. In the table, the column Backup gives the major backup information provider, and most unfilled methods use keyword matching to locate the desired documents. Most search engines on the list not only provide Web search services but also act as portals, which are Web home bases from which users can access a variety of services, including searches, e-commerce, chat rooms, news, and so forth. Table 2 lists some major experimental search engines, which use advanced search technologies not yet implemented by the commercial search engines. The list in Table 2 is a snapshot of the current situation; the list is highly volatile either because a successful experimental search engine is usually commercialized in a short time or because a prototype system is normally removed after its founders leave the organization. The two tables list major general-purpose search engines; special-purpose search engines including specialty searches, regional searches, kid searches, and so forth, are not considered in this article. They use much smaller databases and therefore give more precise and limited search results.

Table 1. Major commercial Web search engines

No. Name URL Type Backup Method
1 AOL Search http://search.aol.com/ Hybrid SE Open Directory
2 AltaVista http://www.altavista.com/ SE LookSmart
3 Ask Jeeves http://www.askjeeves.com/ AS natural language
4 Direct Hit http://www.directhit.com/ SE HotBot hyperlink
5 Excite http://www.excite.com/ SE LookSmart
6 FAST Search http://www.alltheweb.com/ scalability
7 Google http://www.google.com/ SE hyperlink
8 HotBot http://www.hotbot.com/ Hybrid SE Direct Hit
9 Inktomi http://www.inktomi.com/ SE
10 LookSmart http://www.looksmart.com/ Directory Inktomi reviewers
11 Lycos http://www.lycos.com/ Directory Open Directory
12 MSN Search http://search.msn.com/ Directory LookSmart
13 Netscape Search http://search.netscape. com/ SE Open Directory
14 Open Directory http://dmoz.org/ Directory volunteers
15 Yahoo! http://www.yahoo.com/ Directory Google reviewers

FUTURE TRENDS
Users of search engines often submit ambiguous queries. Ambiguous queries can be categorized into four types: i) disorderly, ii) incomplete, iii) incorrect, and iv) superfluous queries. Below are examples of perfect and ambiguous queries and the ranked search results from Infoseek, at http://www.infoseek.com/, for the topic, Intelligent multimedia information retrieval, edited by Mark T. Maybury.
Table 2. Major experimental Web search engines

No. Name URL Method
1 Clever http://www.almaden.ibm.com/cs/k53/clever.html hyperlink
2 Grouper http://longinus.cs.washington.edu/grouper2.html clustering
3 ImageRover http://www.cs.bu.edu/groups/ivc/ImageRover/Home.html image
4 Inquirus http://www.neci.nj.nec.com/homepages/lawrence/inquirus.html metasearch
5 Mercator http://www.ctr.columbia.edu/metaseek/ image
6 MetaSEEk http://www.research.compaq.com/SRC/mercator/ crawler
7 W3QS http://www.cs.technion.ac.il/~konop/w3qs.html SQL
8 WebOQL http://www.cs.tornoto.edu/~gus/weboql/ Object SQL

• Perfect query: Intelligent multimedia information retrieval
1. Intelligent multimedia information retrieval
• Disorderly query: Multimedia information intelligent retrieval
1. Artificial intelligence, fuzzy logic and neural networks
2. Intelligent access to information: research in natural language, information retrieval, com puter vision, multimedia and database
3. Multimedia color PC notebooks
4. Intelligent multimedia information retrieval
• Incomplete query: Multimedia information retrieval
1. Abstract Stein Mulleller Thiel 95
2. Corpora Oct 1998 to -: Corpora: TWLT 14: lan guage technology in multimedia information
3. 3 2.1 Introduction to the workplan
6. Intelligent multimedia information retrieval
• Incorrect query: Intelligent multi-media information retrieval
1. Artificial intelligence research laboratory at Iowa State University
2. Vasant Honavar’s home in cyberspace
3. CIIR multi-media indexing
31. Intelligent multimedia information retrieval
• Superfluous query: Intelligent multimedia infor mation retrieval systems
1. Research in multimedia and multimodal parsing and generation
2. Intelligent multimedia information retrieval
This example shows that even a slight variation in the query produces significant differences among the search results. Users tend to submit ambiguous queries to search engines, most of which use the technology of keyword matching to look for the desired pages. The ambiguity creates undesired search results if keyword matching is used.

CONCLUSION

In less than a decade, the World Wide Web has become one of the three major media, with the other two being print and television. Searching for Web pages is both one of the most common tasks performed on the Web and one of the most frustrating and problematic. This article gave an overview of the current technologies for Web search engines with an emphasis on non-traditional approaches and classified the technologies into six categories. However, apart from the traditional keyword matching techniques, no one method dominates Web search engine technologies. The major reason for this is that the amount of information posted on the World Wide Web is huge and the page formats vary widely.
Another problem with search engines is the skewed search results because of the corporate payments to search engines to boost their rankings. Most search engines list non-commercial pages for free, whereas they charge commercial pages for listing. Take the Google search engine as an example. Other than waiting for their pages being crawled and listed on the Google, the Web content providers may submit their pages via “Add your URL to Google” at http://www.google.com/addurl.html. On the other hand, they may advertise with the Google by using the following two methods:
• AdWords: With AdWords customers create their own ads, choose keywords to tell the Google where to show their ads and pay only when someone clicks on them.
• AdSense: It is for the Web content providers who want to make more revenue from advertising but do not want to serve untargeted ads to their users. AdSense solves this dilemma by delivering text-based AdWords ads that are relevant to what readers see on the pages.
Though the search results are distorted by corporate payments, it is necessary for the search engines to survive and improve their services. Also, the users can normally tell the differences between the paid contents and the ranked search results because they are usually at the different places on the result pages.

KEY TERMS

Authority: A Web site that provides the best source of information on a specific topic.
Automatic Indexing: A process that algorithmically examines information items to build a data structure that can be quickly searched.
Crawler/Spider: A program that automatically scans various Web sites and collects Web documents from them. It follows the links on a site to find other relevant pages and is usually used to feed pages to search engines.
Directory: Other than search engines, directories provide another approach of Web searches. A directory is a subject guide, typically organized by major topics and subtopics, which are created based on the submissions from either Webmasters or editors who have reviewed the pages.
Hub: A Web site that provides collections of links to authorities.
Hyperlinks: A hyperlink is a selectable connection from one word, phrase, picture, or information object to another. By clicking on a hyperlink, a Web user can move easily from one Web page to another page. The most common form of hyperlinks is the highlighted word, phrase, or picture.
Metasearch Engine: It conducts a search using several other search engines simultaneously and then presents the results in some sort of integrated format. This lets users see at a glance which particular search engine returned the best results for a query without having to search each one individually.
Search Engine: A software system such as Google and Alta Vista that searches documents on the World Wide Web and USENET newsgroups for specified keywords and returns a list of the documents which are relevant to the keywords.
Structured Query Language (SQL): It is a standard interactive and programming language for accessing and manipulating a database. Its commands include selection, insertion, update, deletion, finding out the location of data, and so forth.
Uniform Resource Locator (URL): It is the address of an object accessible on the Internet. The object could be an HTML document, a text file, an image file, a program such as a common gateway interface application, and so forth. They are mainly used in HTML documents to specify the target of a hyperlink.

Next post:

Previous post: