Efficient Clustering for Orders - Mining Complex Data

Information Technology Reference

In-Depth Information

Further, if all the sample orders are first converted into the expected rank vectors,

E[ r 1 |

O i ] ,..., E[ r L ∗ |

, then an original k -means algorithm is applied to these vec-

tors. One might suppose that this k -means is equivalent to our k -o'means-EBC, but

this is not the case. A k -means is different from this k -o'means-EBC in terms of the

derivation of centers; In the k -means case, the mean vectors of the expected ranks are

directly used as cluster centers; in a k -o'means case, these means are sorted and con-

verted to rank values. Therefore, in the k -means case, the centers that correspond to

the same central orders are simultaneously kept during clustering. For example, two

mean rank vectors

O i ]

1 . 2 , 1 . 5 , 4 . 0

and

1 , 5 , 10

, correspond to the same central order

x 1

x 3 , but these two vectors are not differentiated. On the other hand, in

a k -o'means-EBC algorithm, they are considered as equivalent, and thus we suppose

that the k -o'means-EBC algorithm can find the cluster structure reflecting the ordinal

similarities among data.

x 2

15.4

Experiments on Artificial Data

We applied the algorithms in Section 15.3 to two types of data: artificially generated

data and real questionnaire survey data. In the the former experiment, we examined

the characteristics of each algorithm. In the latter experiment of the next section, we

analyzed a questionnaire survey data on preferences in sushi.

15.4.1

Evaluation Criteria

The evaluation criteria for partitions was as follows. The same object set was divided

into two different partitions: a true partition π ∗ and an estimated one π . To measure the

difference of π from π ∗ , we adopted the ratio of information loss (RIL) [18], which

is also called the uncertainty coefficient in numerical taxonomy literature. The RIL is

the ratio of the information that is not acquired to the total information required for

estimating a correct partition. This criterion is defined based on the contingency table

for indicator functions [11]. The indicator function I (( x a ,x b ) ,π ) is 1 if an object pair

( x a ,x b ) are in the same cluster; otherwise, it is 0. The contingency table is a 2 × 2

matrix consisting of elements, a st , that are the number of object pairs satisfying the

condition I (( x a ,x b ) ,π ∗ )= s and I (( x a ,x b ) , π )= t , among all the possible object pairs.

RIL is defined as

s =0 t =0

a st

a ··

a ·t

a st

log 2

RIL =

(15.17)

s =0

a s·

a ··

a s·

log 2

where a ·t = s a st , a s· = t a st ,and a ·· = s,t a st . The range of the RIL is [0 , 1];

it becomes 0 if two partitions are identical.

15.4.2

Data Generation Process

Test data were generated in the following two steps: In the first step, we generated the

K orders to be used as central orders. One permutation (we called it a pivot ) consist-

ing of all objects in X ∗ was generated. The other K

−

1 centers were generated by

Mining Complex Data

Search WWH ::

Custom Search

Home