GENE SELECTION STRATEGIES IN MICROARRAY EXPRESSION DATA: APPLICATIONS TO CASE-CONTROL STUDIES - Complex Systems Science in Biomedicine

Biomedical Engineering Reference

In-Depth Information

Figure 4 . Schematic of the way support vector machines operate. ( a ) A set of points belonging

to two classes (cases or controls) are used to create a decision boundary (the optimal separating

hyperplane, diagonal line) that optimally separates between the two classes. ( b ) A validation

set of previously unseen examples is classified on the basis of the decision boundary calculated

using the training set. The points that fall below the optimal hyperplane are deemed to be con-

trols, and those that fall above are deemed to be cases.

have a decision rule, which simply states that new points that fall in the region

where the cases (respectively, controls) fell in the training set will be deemed to

belong to the class of cases (respectively, controls). This is illustrated in Figure

4b, where an independent validation data set is plotted. If we use the optimal

hyperplane as the decision boundary, we can count that two cases fall on the

control side, whereas one control falls on the case side. In the example of Figure

4b, we have a false positive (FP) count of one (one control deemed to be a case)

and a false negative (FN) count of two (two cases deemed to be controls). Simi-

larly, the number of true positives (TP) is 10 (i.e., ten cases deemed to be cases),

and the number of number of true negatives (TN) is 12 (controls deemed to be

controls).

We created a decision rule using the Mayo data (similar to Figure 4a, but in

81 dimensions), and then applied it to the CU data as a validation set (as sche-

matized in Figure 4b). When trained on the Mayo data, the support vector ma-

chine will attempt to find an 80-dimensional hyperplane that divides the 81-

dimensional space into two sides, leaving all the case points on one side of the

plane and all the control points on the other side. The genes selected for the

Mayo data allowed us to find an optimal hyperplane that perfectly separates the

Mayo data, i.e., the Mayo data are linearly separable. When we apply the deci-

sion boundary learned from the Mayo data to the Columbia data, we find that all

the Columbia subjects are perfectly classified , that is, all the CLL patients seg-

regate to the same side of the plane as the CLL patients in the Mayo Clinic. In

like manner, all the control subjects fall on the other side of the plane. (This

classification task was performed within the environment of the Genes@Work

software.) This perfect classification indicates that the group of genes selected

by our gene selection algorithms contains enough information to determine the

status of health of a previously unseen patient.

Search WWH ::

Custom Search

Home