Efficient Clustering for Orders - Mining Complex Data

Information Technology Reference

In-Depth Information

where φ (

) is a normal cumulative

distribution function. Under the minimum square error criterion of this model [14], μ l ,

which is a linearly transformed image of μ l , is analytically derived as

) is a normal distribution density function, and Φ (

Φ − 1 Pr[ x l

x ] ,

μ l =

(15.7)

X C k |

x∈X C k

where X C k = O i ∈C k X i .Thevalueof μ l is derived for each object in X C k . Finally,

the central order O k can be derived by sorting according to the corresponding μ l .Be-

cause the resultant partition by k -o'means-TMSE is dependent on the initial cluster, this

algorithm is run multiple times, randomly changing the initial cluster; then, the partition

minimizing the following total error is selected:

d ( O i , O k ) .

(15.8)

C k ∈π

O i ∈C k

This k -o'means-TMSE could successfully find the cluster structure in a set of incom-

plete orders due to the following reason: Because the dissimilarity in Section 15.3.1

was measured between two orders, the precision of the dissimilarities was unstable. On

the other hand, in the case of k -o'means-TMSE, central orders are calculated based on

the

is generally much larger than two, and much more information

is available; thus, the central order can be stably calculated. The dissimilarity between

the central orders and each sample order can be stably measured, too, because all of

objects in a sample order always exist in the corresponding central order and so the full

information in the sample orders can be considered.

However, the k -o'means-TMSE is not so efficient in terms of time and memory com-

plexity. Time or memory complexity in N and K is linear, and these are efficient. How-

ever, complexity in terms of L ∗ is quadratic, and further, the constant factor is rather

large due to the calculation of the inverse function of a normal distribution. Due to this

inefficiency, this algorithm cannot be used if L ∗ is large. To overcome this inefficiency,

we propose a new method in the next section.

C k |

orders.

C k |

15.3.3

k -o'means-EBC (Expected Borda Count)

To improve efficiency in computation time and memory requirement, though we used

the k -o'means framework in Figure 15.1 and the dissimilarity measure d ρ of equa-

tion (15.4) in step 4 of Figure 15.1, we employed other types of derivation procedures

for the central orders.

Below, we describe this derivation method for a central order O k of a cluster C k

in step 3 of Figure 15.1. We call this the Expected Borda Count (EBC) method, and

our new clustering method is called a k -o'means-EBC algorithm. The Borda Count

method is used to derive central orders from complete orders; we modified this so as to

make it applicable to incomplete orders. The Borda Count method [15] was originally

developed for determining the order of candidates in an election from a set of ranking

votes. A set of complete orders, C k , is given. First, for each object x j in X ∗ ,thevote

count is calculated:

Mining Complex Data

Search WWH ::

Custom Search

Home