Information Technology Reference
In-Depth Information
the lymphography problem with 18 attributes (three numeric and 15 nominal)
and four different classes (normal, metastases, malign lymph, and fibrosis).
And the last one is the postoperative patient problem with eight nominal
attributes and three different classes.
9.3.1 Diagnosis of Breast Cancer
The breast cancer problem is not a typical DT problem in the sense that all its
attributes are numeric and has only two classes. But given the penchant for
conventional decision trees with numeric attributes to overfit the data, it
becomes mandatory that we test the performance of the EDT-RNC algorithm
on a well-studied real-world problem with just numeric attributes in order to
know not only how this algorithm compares with others but also to know
how well these decision trees generalize.
As you would recall, we solved the breast cancer problem with five differ-
ent algorithms (the basic GEA, the GEP-NC and the GEP-RNC algorithms,
and the cellular systems with and without RNCs; see Tables 5.9 and 6.7) and
concluded that all of them perform very efficiently, creating very good mod-
els that generalize extremely well. In this section, we will see that the EDT-
RNC algorithm can also be used to create good DT models for the breast
cancer problem and that these models also generalize extremely well.
So, preparing the data for decision tree induction means that the nine nu-
meric attributes of the breast cancer problem are now the branching nodes of
the decision trees. As usual, they are represented by capital letters, thus giv-
ing: CLUMP THICKNESS, UNIFORMITY OF CELL SIZE, UNIFORM-
ITY OF CELL SHAPE, MARGINAL ADHESION, SINGLE EPITHELIAL
CELL SIZE, BARE NUCLEI, BLAND CHROMATIN, NORMAL NU-
CLEOLI, and MITOSES, all branching off obviously into two. For simplic-
ity and compactness, they will be respectively represented by A = {A, B, C,
D, E, F, G, H, I}, and the two outcomes, benign or malignant, will be respec-
tively represented by T = {a, b}. In this section, we will be using exactly the
same 350 instances that were used for training and the 174 instances that
were used for testing in the previous studies. And given that all the attributes
are normalized between 0 and 1, the random numerical constants will be
drawn from the real interval [0, 1]. The fitness function will consist of the
number of hits and will be evaluated by equation (3.8), giving f max = 350 on
the training set and f max = 174 on the testing set.
Search WWH ::




Custom Search