Biology Reference
In-Depth Information
a seven-gene subset to classify the samples in the test dataset, the results
were very encouraging. All of the 17 samples were correctly classified
(data not shown here), which indicates that selecting the most frequently
appearing genes to form a subset for classification may gain significant
advantages, although these genes may show more biological significance.
The more important point is that the length of the gene subset can be
shortened greatly from around 100 genes to around 10 by SDL.
For comparison with the outputs from the GA algorithms (Li et al .,
2001a), the validation experiments were carried out further. There are two
genes appearing most frequently in the colon class predictor: gene 249
(10 times) and gene 164 (8 times). A subset with only these two genes is
able to classify 16 out of 17 samples in the colon cancer test dataset, while
one sample remains as unclassifiable. Although gene 164 (X57351) was
included in the 50 genes identified by a GA class predictor, the most
frequently selected gene 249 (M63391) in this research was not captured
before. The most frequently selected gene by GA was the human monocyte-
derived neutrophil-activating protein (MONAP). Previous studies have
demonstrated that the expression level of the MONAP gene, whose gene
ID is 1671 in this research, directly correlates with the progression of sev-
eral human cancers (Shi et al ., 1999). Unfortunately, gene 1671 was
missed completely by the SDL method, which might reflect the fact that
there are fundamental differences between GA and SDL in terms of
sampling the search spaces to solve the problems.
For the leukemia data, 219 (shown in Table 5.4) out of 7129 genes in
the dataset were selected by SDL for constructing the class predictor.
Table 5.13 lists the genes appearing more than once in the leukemia
class predictor based on frequency rank. It is worthwhile to note that
gene 2642 (U05259_ma1) and gene 4050 (X03934) both appear 16
times. A subset with only these two genes is able to classify 31 out of
34 samples in the leukemia test dataset (three samples remain unclassi-
fied). When a subset of the top four genes (2642, 4050, 2020, and 1) is
used, 32 out of 34 samples can be predicted correctly, with two remain-
ing unclassified. There are four genes (2642, 2020, 2348, and 3056) in
Table 5.13 that were identified by previous researchers as among the 50
genes most highly correlated with the ALL-AML class distinction
(Golub et al ., 1999); with the method used in this study, 29 out of
Search WWH ::




Custom Search