Efficient Clustering for Orders - Mining Complex Data

Information Technology Reference

In-Depth Information

In this case, respondents' sensation patterns are represented by orders, and analysis

techniques for orders are required.

Orders are also useful when the absolute level of observations cannot be calibrated.

For example, when analyzing DNA microarray data, in order that the same fluoresce

level represents the same level of gene expression, experimental conditions must be

calibrated. However, DNA databases may consist of data sampled under various condi-

tions. Even in such cases, the higher level of fluoresce surely corresponds to the higher

level of gene expression. Therefore, by treating the values in the microarray data as or-

dinal values, non-calibrated data would be processed. Fujibuchi et al. adopted such use

of orders in searching a gene expression database for similar cell types [2].

And clustering is the task of partitioning a sample set into clusters having the prop-

erties of internal cohesion and external isolation [3]. This method is a basic tool for

exploratory data analysis. Clustering methods for orders are useful for revealing the

group structure of data represented by orders such as those described above.

To cluster a set of orders, classical clustering has been mainly used [4, chapter 2].

In these studies, clustering methods were applied to ordinal data of a social survey,

sensory test, etc. These data sets have been small in size; the number of objects to be

sorted and the length of orders are at most ten, and the number of orders to be clustered

are at most thousands. This is because an scoring method has been used to acquire

responses for a large-scale survey. Responses can easily be collected by requesting for

respondents to mark on rating scales that are printed on paper questionnaire forms.

On the other hand, using printed questionnaire forms is not appropriate for ranking

method, because respondents must rewrite entire response orders when they want to

correct them. Therefore, in a ranking method, respondents generally reply by sorting

real objects. For example, respondents are requested to sort glasses of wine according

to their preference. However, it would be costly to prepare so many glasses. Due to this

reason, ranking method has been used for a small-scale survey, even if its advantage

to an scoring method is known as described above. But now, adoption of computer

interface clear this obstacle in using a ranking method. Respondents can sort virtual

objects instead of real objects. Further, methods to implicitly collect preference orders

have proposed [5, 6]. These technical progress has made it easier to collect the large

number of ordinal data.

We can now collect a large-scale data that consist of orders. However, current tech-

niques for clustering orders are not fully scalable. For example, to cluster a set of orders,

dissimilarities are first calculated for all pairs of orders, and agglomerative hierarchical

clustering techniques are applied. This approach is computationally inefficient, because

computational cost of agglomerative hierarchical clustering is O ( N 2 log( N ))) under

non-Euclidean metric [7], where N is the number of orders to be clustered. To alleviate

this inefficiency in terms of N , we proposed a k -means-type algorithm k -o'means in

our previous work [8]. The computational complexity was reduced to O ( N ) in terms of

the number of orders. Though this method successfully extracted a grouping structure in

a set of orders, it was not efficient yet, if the number of possible objects to be sorted was

large. In this paper, to alleviate this inefficiency, we propose a new method, k -o'means-

EBC. Note that EBC means Expected Borda Count , which is a classic method to

find an order so as to be as concordant as possible with a given set of orders. And

Mining Complex Data

Search WWH ::

Custom Search

Home