Automatic Categorization of Web Database Query Results - Advanced Database Query Systems

Database Reference

In-Depth Information

Figure 1. Tree generated by the Greedy method

example, when a user submits a query to MSN House&Home Web site to search for a house located in

Seattle with a price between $200,000 and $300,000, 1,256 tuples are returned. Information overload

makes it hard for the user to separate the interesting items from the uninteresting ones, and thereby lead

to a huge wastage of user's time and effort. In such a situation, the user would pose a broad query in the

beginning to avoid exclusion of potentially interesting results, and then iteratively refine their queries

until a few answers matching their preferences are returned. However, this iterative procedure is time-

consuming and many users will give up before they reach the final stage.

In order to resolve the problem of “information overload”, two types of solutions have been proposed.

The first type categorizes the query results into a category tree (Chakrabarti, Chaudhuri & Hwang, 2004;

Chen & Li, 2007), and second type ranks the results (Agrawal, Chaudhuri, Das & Gionis, 2003; Agrawal,

Rantzau &Terzi, 2006; Bruno, Gravano & Marian, 2002; Chaudhuri, Das, Hristidis & Weikum, 2004;

Das, Hristidis, Kapoor & Sudarshan, 2006). The success of both approaches depends on the utilization

of user preferences. But these approaches always assume that all users have the same user preferences,

but in real life different users often have different preferences. Let us look at the following example.

Example 1. Consider a real estate searching Web site. Figure 1 and Figure 2 respectively show a fraction

of category trees generated by using the methods of Greedy (Chakrabarti, Chaudhuri & Hwang, 2004)

and C4.5-Categorization (Chen & Li, 2007) over 214 houses returned by a query with the condition

“Price between 250000 and 350000 ∧ City = Seattle”. Each of tree nodes specifies the range or equality

conditions on an attribute, and the number in the parentless is the number of tuples satisfying all condi-

tions from the root to the current node. Users can use this tree to select the houses they are interested in.

Consider three users U 1 , U 2 , and U 3 . Assume that U 1 prefers houses with large square, U 2 prefers

houses with water views, and U 3 prefers both water views and Burien living area. The Greedy method

assumed that all users have the same preferences. As a result, attributes “Livingarea” and “Schooldistrict”

are placed at the first two levels of the tree because more users are concerned with “Livingarea” and

“Schooldistrict” than other attributes. However, there may be some users (such as U 2 and U 3 ) who want

to first visit the large square and water view houses. Then they have to visit many nodes if they go along

with the tree built in Figure 1. Considering the diversity of user preferences and the cost of both visiting

intermediate nodes and leaf nodes, the C4.5-Categorization method took advantage of C4.5 algorithm

Search WWH ::

Custom Search

Home