Web Search via Learning from Relevance Feedback

INTRODUCTION

Recently, three general approaches have been taken to increase Web search accuracy and performance. One is the development of meta-search engines (e.g., MetaCrawler, www.metacrawler.com) that forward user queries to multiple search engines at the same time in order to increase the coverage and hope to include what the user wants in a short list of top-ranked results. Another approach is the development of topic-specific search engines that are specialized in particular topics. These topics range from vacation guides (www.vocations.com) to kids’ health (www.kidshealth.com). The third approach is to use some group or personal profiles to personalize the Web search. Examples of such efforts include GroupLens (Konstan et al., 1997).
Those meta-search engines suffer to a certain extent the inherited problem of information overflow that it is difficult for users to pin down specific information for which they are searching. Specialized search engines typically contain much more accurate and narrowly focused information. However, it is not easy for a novice user to know where and which specialized engine to use. Most personalized Web search projects reported so far involve collecting users’ behavior at a centralized server or a proxy server. While it is effective for the purpose of e-commerce, where vendors can collectively learn consumer behaviors, this approach does present the privacy problem.
The clustering, user profiling and other advanced techniques used by those search engines and other projects such as Bollacker et al. (1998) are static in the sense that they are built before the search begins. They cannot be changed dynamically during the real-time search process. Thus they do not reflect the changing interests of the user at different time, at different location or on different subjects. Intelligent Web search systems that dynamically learn the users’ information needs in realtime must be built to advance the state of the art in Web search. Machine learning techniques can be used to improve Web search, because machine learning algorithms are able to adjust the search process dynamically so as to satisfy users’ information needs.


BACKGROUND

There have been great research efforts on applications of machine learning to automatic extraction, clustering and classification of information from the Web. Some earlier research includes WebWatcher (Armstrong et al., 1995), which interactively helps users locate desired information by employing learned knowledge about which hyperlinks are likely to lead to the target information; Syskill and Webert (Pazzani et al., 1996), a system that uses a Bayesian classifier to learn about interesting Web pages for the user; and NewsWeeder (Lang, 1995), a news-filtering system that allows the users to rate each news article being read and learns a user profile based on those ratings. Some research is aimed at providing adaptive Web service through learning. For example, Ahoy! The Homepage Finder (Shakes et al., 1997) performs dynamic reference shifting; and Adaptive Web Sites (Perkowitz & Etzioni, 2000) automatically improves their organization and presentation based on user access data.
A series of work in Chen et al. (1999, 2000, 2001, 2002) and Meng and Chen (2004) study intelligent Web search as an adaptive learning process, where the search engine acts as a learner and the user as a teacher. The user sends a query to the engine, and the engine uses the query to search the index database and returns a list of URLs that are ranked according to a ranking function. Then the user provides the engine relevance feedback, and the engine uses the feedback to improve its next search and returns a refined list of URLs. The learning (or search) process ends when the engine finds the desired documents for the user. Conceptually, a query entered by the user can be understood as the logical expression of the collection of the documents wanted by the user. A list of URLs returned by the engine can be interpreted as an approximation to the collection of the desired documents.

ADAPTIVE LEARNING FOR WEB SEARCH

Let X denote the set of all index keywords for the whole Web (or, practically, a portion of the whole Web). Given any Web document d, let 1(d) denote the set of all index keywords in X that are used to index d with non-zero values. Then, the following two properties hold. (1) The size of 1(d) is substantially smaller than the size of X. Practically, 1(d) can be bounded by a constant. The rationale behind this is that in the simplest case only a few of the keywords in d are needed to index it. (2) For any search process related to the search query q, let D(q) denote the collection of all the documents that match q; then the set of index keywords relevant to q, denoted by F(q), is F(q) d D{q) 1 (d). Although the size of F(q) varies from different queries, it is still substantially smaller than the size ofX, and might be bounded by a few hundred or a few thousand in practice.
Definition 1 - Given any search query q, F(q), which is given in the previous paragraph, is defined as the set of dynamic features relevant to the search query q.
Definition 2 - Given any search query q, the dynamic vector space V(q) relevant to q is defined as the vector space that is constructed with all the documents in D(q) such that each of those documents is indexed by the dynamic features in F(q).
Adaptive learning for intelligent Web search as studied in Chen et al. (1999, 2000, 2001, 2002) and Meng and Chen (2004) is formulated as follows. Let S be a Web search system. For any query q, S first finds the set of documents D(q) that match the query q. It finds D(q) with the help of a general-purpose search strategy through searching its internal database, or through external search engine such as AltaVista (www.altavista.com) when no matches are found within its internal database. It then finds the set of dynamic features F(q), and later constructs the dynamic vector space V(q). Once D(q), F(q) and V(q) have been found, S starts its adaptive learning process with the help of the learning algorithm that is to be presented in the following subsections. More precisely, let F(q) {K1,.., Kn} such that each K denotes a dynamic feature (i.e., an index keyword). S maintains a common weight vector w (w1,… , wn) for dynamic features in F(q). The components of w have non-negative real values. The learning algorithm uses w to extract and learn the most relevant features and to classify documents in D(q) as relevant or irrelevant.
Practically efficient adaptive learning algorithms such as TW2 have been developed in (Chen et al. (1999, 2000,

Figure 1. Algorithm TW2

Algorithm TW2
2001, 2002) and Meng and Chen (2004). For example, for each query q entered by the user, algorithm TW2 uses a common weight vector w and a real-valued threshold q to classify documents in D(q). Let a > 1 be the promotion and demotion factor. TW2 classifies documents whose
tmp128-12_thumb
and all others as irrelevant. When the user responds with a document that may or may not contradict to the current classification, TW2 updates the weights through promotion or demotion. The detailed description of TW2 is given in Figure 1.
In contrast to the linear lower bounds proved for Rocchio’s similarity-based relevance feedback algorithm (Chen & Zhu, 2002), TW2 has a very small mistake bounds for learning any collection of documents represented by a disjunction of a small number of relevant features. As pointed out in Chen et al. (2002) that to learn a collection of documents represented by a disjunction of at most k relevant features (or index keywords) over the n-dimen-sional Boolean vector space, TW2 makes at most
tmp128-13_thumb
number of dynamic features occurring in the learning process.
Other multiplicative adaptive algorithms for user preference retrieval were designed in Chen (2001) and Meng and Chen (2004). It was shown that those two algorithms have substantially better performances than Rocchio’s similarity-based relevance feedback and the gradient descent procedure (Wong et al., 1988).

SYSTEM IMPLEMENTATION

Several experimental intelligent Web search systems have been implemented and analyzed in Chen et al. (1999, 2000, 2001, 2002) Meng and Chen (2004). An exemplary architecture of those systems is shown in Figure 2. In general, such a system has a graphic user interface to allow the user to enter his/her query and to specify the number of the top matched document URLs to be returned. It may or may not maintain an internal index database of documents. It has a meta-search module to query a set of general-purpose search engines such as AltaVista whenever needed. When the user enters a query and starts a search process, it first searches its internal index database. If no relevant documents can be found within its index database then it receives a list of top matched documents externally with the help of its meta-search module.

Figure 2. System architecture

System architecture
At each iteration, the system provides the user a short list of documents so that the user can judge whether a document is relevant or not. After the user finishes judging documents as relevance feedback to the system, the query/feedback parser parses out the feedback and sends it to the adaptive learning module for the underlying learning algorithms to learn the user’s preference. At the end of the current iteration of learning, the system re-ranks the documents with its document ranker and displays a short list of documents to the user. At each iteration, the dispatcher parses query or relevance feedback information from the interface and decides which of the following components should be invoked to continue the search process: adaptive learning module, or index database, meta search module, or other module if there is one. The document parser and indexer can parse and index documents returned by the meta search module, and send those documents to the index database or document ranker. The other module can have additional functions for clustering, profiling, personalizing or feature extracting, to enhance the capability of the adaptive learning module.
The experimental systems in Chen et al. (1999, 2000, 2001, 2002) and Meng and Chen (2004) are evaluated with performance metrics of relative recall, precision, response time, and/or ranking accuracy. For example, relative recall and precision are defined as follows. For any query q, the relative recall and precision are
tmp128-15_thumb
where R is the total number of relevant documents among the set of the retrieved documents, and Rm is the number of relevant documents ranked among the top m positions in the final search result of the search engine. Empirical analytical results in Chen et al. (1999, 2000, 2001, 2002) and Meng and Chen (2004) all show that adaptive learning algorithms significantly increase Web search performance and accuracy in comparison to general-purpose search engines such as AltaVista.

FUTURE TRENDS

With an overwhelming amount of information available on the Web, using an adaptive learning approach to refine Web search results will become the natural next step of Web search. Users can interact with search engines to refine their queries to narrow down the collection of search results. In this refinement process, the search engine is learning the user’s information needs to provide more accurate search results. During this interactive process, the search engines can also pick up other information automatically or manually regarding this particular search from the user. Such information may include past search history, and user preferences, among others. With the capability of interacting with users, the search engines will become more intelligent and a true assistant in serving users’ information needs.

CONCLUSION

This article presents an overview of research on intelligent Web search through adaptive learning from relevance feedback (Chen et al., 1999, 2000, 2001, 2002; Meng & Chen, 2004).
The authors believe that other techniques such as Web mining and clustering shall be used to further enhance the performance of the adaptive learning algorithms. Adaptive learning approach to intelligence Web search requires the user to interact with the search engine. Through such interaction the user provides relevance feedback to the adaptive learning algorithm used by the search engine to improve search results. Adaptive learning algorithms cannot work well without users’ interaction of relevance feedback. On other hand, Web mining and clustering can uncover hidden patterns of user access behaviors and associations of these patterns with groups of users from large Web access logs and user profiles. These patterns and associations can be used to predict user interests or information needs in order to help the search engine to filter out undesired Web pages and to recommend desired ones. Web mining and clustering are essentially retrospective techniques, and cannot adaptively interact with the user. Once incorporated with an adaptive learning algorithm, these techniques can help the learning algorithm to build the initial search result for the user before the interaction starts. They can also help the adaptive learning algorithm to include, or exclude, some of the Web pages, when a new search result is computed with respect to the user’s recent relevance feedback.

KEY TERMS

Adaptive Learning: An algorithm that can learn a hidden concept from interactive feedback provided by a user or the underlying environment.
Document Ranking: A function that scores documents in a collection according to their relevance to a given query so that the more relevant a function is, the higher score it has.
Intelligent Web Search: A Web search system that learns a user’s information preference.
Precision: Precision is the percentage of relevant document in a given query in the set of documents that are returned by an information retrieval system.
Recall: Recall is the percentage of relevant documents in a given query in a collection that are returned by an information retrieval system.
Relevance Feedback: The information that is judged by a user and provided to an information system.
Vector Space: A model that represents every document in a collection with an n-dimensional vector for some constant n.

Next post:

Previous post: