An Efficient Two-Stage Gene Selection Method for Microarray Data - Intelligent Computing for Sustainable Energy and Environment

Information Technology Reference

In-Depth Information

task. The wrapper method [6] usually consists of the search procedure and the

evaluation criterion. However, exhaustive search of all subsets is too expensive

to implement from a high-dimensional feature space. Unlike the filter and wrap-

per that separate the variable selection and training processes, the embedded

methods like the boosting method [7] incorporated feature selection into the

construction procedure of the classifier or regression model.

Recently, some new embedded learning approaches have been proposed to

achieve grouping effect by introducing the penalty term into the cost function.

The most popular algorithm is the Elastic Net [8], which can encourage a group-

ing effect, i.e., the system variables (genes/regressors) can be naturally grouped

together according to regulatory pathways where the within-group correlations

are very high. In fact, the Elastic Net method is a forward selection method.

It is well known that all previously selected regressors remain fixed and some

insignificant regressors can not be removed from the model later. This probably

leads the Elastic Net to miss a good model. Therefore, further effort still need

to been made to improve the prediction performance and to reduce model size.

In this paper, an ecient two-stage gene selection (TSGS) method is proposed,

which can solve the small samples and variant correlation problems. The paper

is organized as follows. Section 2 gives some preliminaries. Section 3 presents the

proposed two-stage gene selection method. Simulation results are presented in

Section 4, followed by concluding remarks in Section 5.

2 Preliminaries

Suppose a set of data samples, denoted as D N =

where x i = x i 1 ,...,x iM ] T is the input vector, and y i is output. Let y =

[ y (1) ,...,y ( N )] T be the response and X =[ x 1 ; x 2 ; ... ; x N ]=[ x (1) ,x (2) ,...,x ( M ) ]

be model matrix, where x ( j ) =[ x 1 j ,...,x Nj ] T ,j =1 ,...,M represent the pre-

dictors (i.e., gene or candidate regressors). After a location and scale transfor-

mation procedure, the input and response variables are centred (mean=0) to

remove the intercept and also the input vectors are standardized, i.e.

{

( x i ,y i ) ,i =1 ,...,N

}

i =1 y i =0 ,

i =1 x ij =0 ,

i =1 x ij =1 ,j =1 ,...,M.

(1)

If these data are microarray gene expression data, then x ( j ) represents the j th

gene, x i represents the expression levels of M genes of the i th sample tissue and y i

may represent the tumor type. The gene selection problem can be approximated

by a linear-in-the-parameters model of the form

y = Xβ + Ξ,

(2)

where β =[ β 1 ,...,β M ] T

M is the estimated coecients, Ξ =[ ε (1) ,...,

∈

ε ( N )] T

N is residual.

Different cost functions, often involving a trade-off between model complex-

ity and training accuracy, lead to alternative architectures. Therefore, the core

∈

Intelligent Computing for Sustainable Energy and Environment

Search WWH ::

Custom Search

Home