Biology Reference
In-Depth Information
and assumed that the relevant genes were similarly and uniformly
expressed among samples of each type. A multivariate approach that com-
pares samples in a multi-gene dimension using genetic algorithms (GAs)
was proposed (Li et al ., 2001a). Samples were classified based on the
class membership of their k -nearest neighbors (KNN) in the gene space.
The dimensionality (length) of the gene subset was arbitrarily set to
50. GAs were used to select hundreds and thousands of subsets of
50 genes that could potentially discriminate between two classes of
samples (tumor and normal tissues). The frequency with which genes
were selected was statistically analyzed in the large number of 50-
dimension gene subsets. The most frequently selected 50 genes were
used to predict 34 new samples. Although the performance of the GA
predictor with 50 genes was remarkable, only 29 of 34 test samples were
correctly predicted with high confidence (Li et al ., 2001b). To improve
the success rate of classification, more reliable and accurate algorithms
are needed.
Many machine learning and data mining technologies have recently
been introduced into the field of microarray data analysis to process many
subsets of genes simultaneously (Anderle et al ., 2003; Brown et al ., 2000;
Ooi and Tan, 2003; Wren et al ., 2004). It is obvious that there is no one
feasible approach to evaluate all possible subsets of genes in a given
dataset consisting of several thousands of genes. Even with a moderate
number of gene elements in a gene subset and a small number of choices
for each gene element, the number of possible gene combinations for the
gene subset increases rapidly. The true magnitude of the problem can be
seen by considering a scanning approach, which measures the objective
function value for every possible combination of genes. For example, let
us consider scanning a 10-gene subset using the colon data with 2000
genes (2000 gene expression measurements per sample); the total number
of possible combinations is approximately more than 10 30 (2000! divided
by 1990!), which would take years for even a supercomputer to complete.
Efficient algorithms are needed to sample from fewer subsets to find the
best-performing subsets (optimal or near-optimal solutions). Obviously,
the problem is one of optimization or global optimization. In order to
solve “hard” problems such as gene selection, classification, and cluster-
ing, suitable optimization algorithms must be used.
Search WWH ::




Custom Search