Database Reference
In-Depth Information
CATEGORY TREE CONSTRUCTION
This section proposes the category tree construction algorithm. Section 5.1 gives an overview of our
algorithm. Section 5.2 presents a novel partitioning criterion that considers the cost of visiting both
intermediate and leaves nodes.
Algorithm Overview
A category tree is very similar to a decision tree. There are many well-known decision construction al-
gorithms such as ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993), and CART (Breiman, Friedman & Stone,
1984). However, the existing decision tree construction algorithm aims at minimizing the impurity of
data (Quinlan, 1993) (represented by information gain, etc.). Our goal is to minimize the category cost,
which includes both the cost of visiting intermediate tree nodes (and the cost of visiting tuples in the
intermediate nodes if the user explores it) and the cost of visiting tuples stored in leaf nodes.
For building the category tree, we make use some ideas of solution presented by Chen (Chen & Li,
2007) and propose the improved algorithms for solving it. The problems of our algorithm have to resolve
including (i) eliminating a subset of relatively unattractive attributes without considering any of their
partitions, and (ii) for every attribute selected above, obtaining a good partition efficiently instead of
enumerating all the possible partitions. Finally, we construct the category tree by choosing the attribute
and its partition that has the least cost.
Reducing the Choices of Categorizing Attribute
Since the presence of a selection condition on an attribute in query history reflects the user's interest in
that attribute, attributes that occur infrequently in the query history can be omitted while constructing
the category tree. Let N ( A ) be the number of queries in the query history that contain selection condition
on attribute A and N be the total number in the query history. We eliminate the uninteresting attributes
using the following solution: if an attribute A occurs in less than a fraction x of the queries in the query
history, i.e., N ( A )/ N < x , we eliminate A . The threshold x will need to be specified by the system domain
expert. For example, for the real estate searching application, if we use x = 0.4, only 7 attributes, namely
Price, SqFt, Livinarea, View, Neighborhood, Schooldistrict, and Bedrooms, are retained from among 25
attributes in the MSN House&Home dataset. The algorithm for eliminating the uninterested attributes
is shown in Algorithm 5.
Algorithm 5. Eliminate uninterested attribute algorithm
Input: pruned query history H, threshold x
Output: A R = { A 1 , …, A k } which are the attribute set retained after eliminating
the uninterested attributes in H
1. For each attribute A i of the attribute set A H appeared in H , Compute its
N ( A i )/ N , eliminate A i if its N ( A i )/ N < x .
2. Output A R = { A 1 , …, A k }.
Search WWH ::




Custom Search