Automatic Categorization of Web Database Query Results - Advanced Database Query Systems

Database Reference

In-Depth Information

Figure 2. Tree generated by the C4.5-Categorization method

to create the navigational tree. But the created category tree (Figure 2) has two drawbacks: (i) the tuples

under the intermediate nodes cannot be explored by the users, i.e., users can only access the tuples under

the leaf nodes but cannot examine the tuples in the intermediate nodes; (ii) the cost of visiting the tuples

of intermediate node is not considered if the user choose to explore the tuples of intermediate node.

User preferences are often difficult to obtain because users do not want to spend extra efforts to specify

their preferences, thus there are two major challenges to address the diversity issue of user preferences:

(i) how to summarize different kinds of user preferences from the behavior of all users already in the

system, and (ii) how to categorize or rank the query results according to the specific user preferences.

Query history has been widely applied to infer the preferences of all users in the system (Agrawal,

Chaudhuri, Das & Gionis, 2003; Chaudhuri, Das, Hristidis & Weikum, 2004; Chakrabarti, Chaudhuri

& Hwang, 2004; Das, Hristidis, Kapoor & Sudarshan, 2006).

In this chapter, we present techniques to automatically categorize the results of user queries on Web

databases in order to reduce information overload. We propose a two-step approach to address both

challenges for the categorization case. The first step analyzes query history of all users already in the

system offline and then generates a set of clusters over the data. Each cluster corresponds to one type of

user preferences and is associated with a probability that users may be interested in the cluster. Assume

that an individual user's preference can be represented as a subset of these clusters. When a specific user

submits a query, the second step first compute the similarity between the query and the representative

queries in the query clusters, and then the data clusters the user may be interested in can be inferred by

the query. Next, the set of data clusters generated in the first step is intersected with the query answers

and then a labeled hierarchical category structure is generated automatically based on the contents of the

tuples in the answer set. Consequently, a category tree is automatically constructed over these intersected

clusters on the fly. This tree is finally presented to the user.

This chapter presents a domain-independent approach to addressing the information overload problem.

The contributions are summarized as follows:

Search WWH ::

Custom Search

Home