Information Technology Reference
In-Depth Information
Further, if all the sample orders are first converted into the expected rank vectors,
E[
r
1
|
O
i
]
,...,
E[
r
L
∗
|
, then an original
k
-means algorithm is applied to these vec-
tors. One might suppose that this
k
-means is equivalent to our
k
-o'means-EBC, but
this is not the case. A
k
-means is different from this
k
-o'means-EBC in terms of the
derivation of centers; In the
k
-means case, the mean vectors of the expected ranks are
directly used as cluster centers; in a
k
-o'means case, these means are sorted and con-
verted to rank values. Therefore, in the
k
-means case, the centers that correspond to
the same central orders are simultaneously kept during clustering. For example, two
mean rank vectors
O
i
]
1
.
2
,
1
.
5
,
4
.
0
and
1
,
5
,
10
, correspond to the same central order
x
1
x
3
, but these two vectors are not differentiated. On the other hand, in
a
k
-o'means-EBC algorithm, they are considered as equivalent, and thus we suppose
that the
k
-o'means-EBC algorithm can find the cluster structure reflecting the ordinal
similarities among data.
x
2
15.4
Experiments on Artificial Data
We applied the algorithms in Section 15.3 to two types of data: artificially generated
data and real questionnaire survey data. In the the former experiment, we examined
the characteristics of each algorithm. In the latter experiment of the next section, we
analyzed a questionnaire survey data on preferences in sushi.
15.4.1
Evaluation Criteria
The evaluation criteria for partitions was as follows. The same object set was divided
into two different partitions: a true partition
π
∗
and an estimated one
π
. To measure the
difference of
π
from
π
∗
, we adopted the
ratio of information loss
(RIL) [18], which
is also called the uncertainty coefficient in numerical taxonomy literature. The RIL is
the ratio of the information that is not acquired to the total information required for
estimating a correct partition. This criterion is defined based on the contingency table
for indicator functions [11]. The indicator function
I
((
x
a
,x
b
)
,π
) is 1 if an object pair
(
x
a
,x
b
) are in the same cluster; otherwise, it is 0. The contingency table is a 2
×
2
matrix consisting of elements,
a
st
, that are the number of object pairs satisfying the
condition
I
((
x
a
,x
b
)
,π
∗
)=
s
and
I
((
x
a
,x
b
)
, π
)=
t
, among all the possible object pairs.
RIL is defined as
s
=0
t
=0
a
st
a
··
a
·t
a
st
log
2
RIL =
,
(15.17)
s
=0
a
s·
a
··
a
··
a
s·
log
2
where
a
·t
=
s
a
st
,
a
s·
=
t
a
st
,and
a
··
=
s,t
a
st
. The range of the RIL is [0
,
1];
it becomes 0 if two partitions are identical.
15.4.2
Data Generation Process
Test data were generated in the following two steps: In the first step, we generated the
K
orders to be used as central orders. One permutation (we called it a
pivot
) consist-
ing of all objects in
X
∗
was generated. The other
K
−
1 centers were generated by