Decision Tree Induction - Gene Expression Programming

Information Technology Reference

In-Depth Information

the lymphography problem with 18 attributes (three numeric and 15 nominal)

and four different classes (normal, metastases, malign lymph, and fibrosis).

And the last one is the postoperative patient problem with eight nominal

attributes and three different classes.

9.3.1 Diagnosis of Breast Cancer

The breast cancer problem is not a typical DT problem in the sense that all its

attributes are numeric and has only two classes. But given the penchant for

conventional decision trees with numeric attributes to overfit the data, it

becomes mandatory that we test the performance of the EDT-RNC algorithm

on a well-studied real-world problem with just numeric attributes in order to

know not only how this algorithm compares with others but also to know

how well these decision trees generalize.

As you would recall, we solved the breast cancer problem with five differ-

ent algorithms (the basic GEA, the GEP-NC and the GEP-RNC algorithms,

and the cellular systems with and without RNCs; see Tables 5.9 and 6.7) and

concluded that all of them perform very efficiently, creating very good mod-

els that generalize extremely well. In this section, we will see that the EDT-

RNC algorithm can also be used to create good DT models for the breast

cancer problem and that these models also generalize extremely well.

So, preparing the data for decision tree induction means that the nine nu-

meric attributes of the breast cancer problem are now the branching nodes of

the decision trees. As usual, they are represented by capital letters, thus giv-

ing: CLUMP THICKNESS, UNIFORMITY OF CELL SIZE, UNIFORM-

ITY OF CELL SHAPE, MARGINAL ADHESION, SINGLE EPITHELIAL

CELL SIZE, BARE NUCLEI, BLAND CHROMATIN, NORMAL NU-

CLEOLI, and MITOSES, all branching off obviously into two. For simplic-

ity and compactness, they will be respectively represented by A = {A, B, C,

D, E, F, G, H, I}, and the two outcomes, benign or malignant, will be respec-

tively represented by T = {a, b}. In this section, we will be using exactly the

same 350 instances that were used for training and the 174 instances that

were used for testing in the previous studies. And given that all the attributes

are normalized between 0 and 1, the random numerical constants will be

drawn from the real interval [0, 1]. The fitness function will consist of the

number of hits and will be evaluated by equation (3.8), giving f max = 350 on

the training set and f max = 174 on the testing set.

Search WWH ::

Custom Search

Home