Automatic Categorization of Web Database Query Results - Advanced Database Query Systems

Database Reference

In-Depth Information

Figure 6. Algorithms' performance for the queries clustering

is the number of vector elements in <attribute, value> pairs set of Δ (note that, each query in the query

history is translated into vector representations), m is the number of input queries, and l is the number

of true underlying clusters. We set '0' and '1' at random on the n elements and we then generate l ran-

dom queries by sampling at random the space of all possible permutations of n elements. These initial

queries form the centers around which we build each one of the clusters. The task of the algorithms is

to rediscover the clustering model used for the data generation. Given a cluster center, each query from

the same cluster is generated by adding to the center a specified amount of noise of a specific type. We

consider two types of noise: swaps and shifts . The swap means that '0' and '1' elements from the initial

order are picked and their positions in the order are exchanged. For the shifts we pick a random element

and we move it to a new position, either earlier (or later) in the order. All elements that are between the

new and the old positions of the element are shifted on position down (or up). The amount of noise is

the number of swaps or shifts we make.

We experiment with datasets generated for the following parameters: n = 300, m = 600, l = {4, 8, 16,

32}, noise = {2, 4, 8,…, 128} for swaps. Figure 6 shows the performance of the algorithms as a function

of the amount of noise. The y axis is the ratio: F ( A )/ F ( INP ), for A = {Greedy, Greedy-Refine}, where

the Greedy- Refine algorithm is proposed in this chapter (Algorithm 3), while the Greedy algorithm

is proposed in [2]. We compare them here since they all aim at solving the same problem (clustering

problem). The F ( A ) is the total cost of the solution provided by algorithm A when the distance showed

in Equation (3) is used as a distance measure between queries. The F ( INP ) corresponds to the cost of

the clustering structure (Equation 4) used in the data generation process.

From Figure 6 we can see that: Greedy- Refine algorithm performs greatly better than Greedy algo-

rithm. The reason is that: the Greedy- Refine is executed on the queries which were arranged according

Search WWH ::

Custom Search

Home