Signature Selection for Grouped Features with a Case Study on Exon Microarrays - Feature Selection for Data and Pattern Recognition - page 334

Information Technology Reference

In-Depth Information

Fig. 14.3 Overall prediction performance of three feature selection methods: LASSO, GL, and

SGL. Left prediction performance in AUC score on test sets over 20 random subsampling trials

(train:test

=

70:30%.) Right the corresponding number of selected features

14.3.3.2 Probabilistic Prediction

In logistic regression, the probability that an example x i will have the label “1” is

modeled by the logistic function

1

y i

x i

P

(

=

1

|

) =

+ ʲ 0 ) } ∈[

0

,

1

] ,

1

+

exp

{− ( ʲ

T x i

ʲ 0 are coefficients estimated during training. Note that this function

always returns a value between zero and one. That is, given

ʲ

where

and

ʲ 0 , logistic regres-

sion provides each test point with a probabilistic outcome in addition to a binary

prediction. This makes a clear distinction to other classification methods such as the

support vector machines [ 3 , 21 ].

For two logistic regression classifiers with similar binary prediction performance

(for example, in terms of AUC scores), a method that gives higher probability for

correct predictions would be arguably preferred in practice, since it provides higher

confidence on its predictions.

Figure 14.4 compares such probability values for the three feature selection

methods LASSO, GL, and SGL. The x-axis shows the indices of test examples

(in a test set created by random subsampling), while the y-axis shows the probability

values we discussed above. The circles show the true labels, 0 or 1. The probability

outcomes from each algorithm are connected by lines only for visual distinction,

without any other implication. The decision probability ( P

ʲ

and

(

y

=

1

) =

0

.

5) is shown

as a horizontal line.

As we can see, GL and SGL provided higher values of probability outcomes for

correct labels ( P

), at least for this particular test

set. The characteristics of GL and SGL were similar: on 11th example GL provided

slightly higher probability than SGL, and both misclassified 16th, 22nd and 23rd

examples that were classified correctly by LASSO.

(

y

=

1

)

or P

(

y

=

0

) =

1

−

P

(

y

=

1

)

Next Page

Feature Selection for Data and Pattern Recognition

Search WWH ::

Custom Search

Home