The Complexity of Feature Selection for Consistent Biclustering - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

classify the samples. Sample j is assigned to class r if c jr =max ξ {c jξ }

, i.e.,

S r =

a j

c jr >c jξ ,

= r. (13.4)

As before, the obtained classification S r does not necessarily coincide with clas-

sification S r .

Biclustering

∈

⇒

∀

ξ,ξ

is referred to as a consistent biclustering if relations (13.3) and

(13.4) hold for all elements of the corresponding classes, where matrices C S and

C F are defined according to (13.1) and (13.2), respectively.

A data set is biclustering-admitting if some consistent biclustering for it ex-

ists. Furthermore, the data set is called conditionally biclustering-admitting with

respect to a given (partial) classification of some samples and/or features if there

exists a consistent biclustering preserving the given (partial) classification.

Theorem 13.1. Let

be a consistent biclustering. Then there exist convex cones

m such that only samples from S r belong to the corresponding

cone P r , r =1 ,...,k . Similarly, there exist convex cones Q 1 ,Q 2 ,...,Q k ⊆ R

P 1 ,P 2 ,...,P k ⊆ R

such that only features from class F r belong to the corresponding cone Q r , r =

1 ,...,k .

See [3] for the proof of Theorem 13.1. It also follows from the proven conic

separability that convex hulls of classes do not intersect.

By definition, a biclustering is consistent if F r = F r and S r = S r .However,

a given data set might not have these properties. The features and/or samples in

the data set might not clearly belong to any of the classes and hence a consistent

biclustering might not be constructed. In such cases, one can remove a set of

features and/or samples from the data set so that there is a consistent biclustering

for the truncated data. Selection of a representative set of features that satisfies

certain properties is a widely used technique in data mining applications. This

feature selection process may incorporate various objective functions depending

on the desirable properties of the selected features, but one general choice is to

select the maximal possible number of features in order to lose minimal amount

of information provided by the training set.

A problem with selecting the most representative features is the following.

Assume that there is a consistent biclustering for a given data set, and there is a

feature, i , such that the difference between the two largest values of c ir is negligi-

ble, i.e.,

c ir −

c iξ }≤

min

= r {

α,

where α is a small positive number. Although this particular feature is classified

as a member of class r (i.e., a i ∈

F r ), the corresponding relation (13.3) can be

Clustering Challenges in Biological Network

Search WWH ::

Custom Search

Home