Efficient Clustering for Orders - Mining Complex Data

Information Technology Reference

In-Depth Information

d ρ ( O 1 ,O 2 )=1

−

ρ ( O 1 ,O 2 ) .

(15.4)

Since the range of ρ is [

1 , 1], this dissimilarity ranges [0 , 2]. This dissimilarity

becomes 0 if the two orders are concordant.

−

15.3

Methods

Here, we describe exiting clustering methods and our new clustering method.

15.3.1

Hierarchical Clustering Methods

In the literature of psychometrics, questionnaire data obtained by a ranking method have

been processed by traditional clustering techniques [4]. First, for all pairs of orders in

S , the dissimilarities in Section 15.2.1 are calculated, and a dissimilarity matrix for S is

obtained. Next, this matrix can be clustered by standard hierarchical clustering methods,

such as the group average method. In these survey researches, the size of the processed

data set is rather small ( N< 1000 ,L ∗ < 10, L i < 10). Therefore, hierarchical

clustering methods could cluster order sets, even though the time complexity of these

methods is O ( N 2 log( N )) under non-Euclidean metric [7] and is costly. However, these

method cannot be applied to a large-scale data, due to their computational cost.

Additionally, when the number of objects, L ∗ , is large, it is hard for respondents

to sort all objects in X ∗ . Therefore, sample orders are generally incomplete, i.e.,

X ( O i )

X ∗ , the dissimilarities cannot be calculated because the dissimilarity mea-

sures are defined between two orders consisting of the same objects. One way to deal

with incomplete orders is to introduce the notion of an Incomplete Order Set (IOS) 1 [4],

which is defined as a set of all possible complete orders that are concordant with the

given incomplete order. Given the incomplete order O that consists of the object set X ,

an IOS is defined as

⊂

O i |

O i

is concordant with O, X ( O i )= X ∗ }

ios( O )=

{

This idea is not fit for large-scale data sets because the size of the set is ( L ∗ ! /L !),which

grows exponentially in accordance with L ∗ . Additionally, there are some difficulties in

defining the distances between the two sets of orders. One possible definition is to adopt

the arithmetic mean of the distances between orders in each of the two sets. However,

this is not distance because d (ios a , ios a ) may not be 0. Therefore, more complicated

distance, i.e., Hausdorff distance, has to be adopted.

Since the above IOS cannot be derived for a large-scale data set, we adopted the

following heuristics in this paper. In such cases, the dissimilarity between the orders is

determined based on the the objects included in both. Take, for example, the following

two orders:

O 1 = x 1

x 3

x 4

x 6 ,

O 2 = x 5

x 4

x 3

x 2

x 6 .

1 In [4], this notion is referred by the term incomplete ranking , but we have adopted IOS to insist

that this is a set of orders.

Mining Complex Data

Search WWH ::

Custom Search

Home