Biology Reference
In-Depth Information
constraint for cluster size can unnecessarily increase the problem size. Hence, we
can further tighten the constraint for cluster size as a plausible strategy for further
expediting the clustering of even larger data sets.
Indexing the new cluster size ranges by l, for l = 1,.,d, we introduce a new
binary variable w jl , which equals one if cluster j belongs to size class l, and zero
if otherwise. The additional constraints are then formulated as follows:
d
l =1 w jl =1 ,
j
(16.9)
c j =1 w jl
n c
n c + ,
l
(16.10)
l =1 d l,min i =1 w ij l =1 d l,max ,
j (16.11)
The first set of constraints allows each cluster into only one size class. The
second constraint restricts the number of clusters allowable in each class. The
parameter ε is judiciously picked to allow a reasonable range over the number of
clusters in each class, for instance 10%. Finally, the third constraint bounds the
size of the clusters allowed in each class. These additional constraints also involve
the variable w ij but not z jk ; hence they are all included into the master problem.
16.4. Conclusion
In our study, we propose a novel clustering algorithm (EP GOS Clust) based on
a Mixed-Integer Nonlinear Programming (MINLP) formulation. We test our pro-
posed algorithm on a substantially large dataset of gene expression patterns from
the yeast Saccharomyces Cerevisiae, and show that our method compares favor-
ably (if not outperforms) with other clustering methods in identifying data points
that are the most similar to one another as well as identifying clusters that are the
most dissimilar to one another. We also show that the EP GOS Clust is capable
of uncovering tightly-correlated clusters. Given the nature of the test datasets, we
too show that the EP GOS Clust does well in uncovering clusters with good bi-
ological coherence. In addition, we demonstrate the utility of the pre-clustering
procedure and a methodology that works in concert with the algorithm itself to
predict the optimal number of clusters. For consistency, we repeated our study
on other DNA microarray datasets based on the various glucose signaling path-
ways in the yeast Saccharomyces Cerevisiae (other results not reported here) and
obtained similar result trends.
16.5. Computational Resources
All optimization formulations are written in GAMS (General Algebraic Modeling
System) [7] and solved using the commercial solver CPLEX 8.0.
GAMS is a
Search WWH ::




Custom Search