Information Technology Reference
In-Depth Information
task. The wrapper method [6] usually consists of the search procedure and the
evaluation criterion. However, exhaustive search of all subsets is too expensive
to implement from a high-dimensional feature space. Unlike the filter and wrap-
per that separate the variable selection and training processes, the embedded
methods like the boosting method [7] incorporated feature selection into the
construction procedure of the classifier or regression model.
Recently, some new embedded learning approaches have been proposed to
achieve grouping effect by introducing the penalty term into the cost function.
The most popular algorithm is the Elastic Net [8], which can encourage a group-
ing effect, i.e., the system variables (genes/regressors) can be naturally grouped
together according to regulatory pathways where the within-group correlations
are very high. In fact, the Elastic Net method is a forward selection method.
It is well known that all previously selected regressors remain fixed and some
insignificant regressors can not be removed from the model later. This probably
leads the Elastic Net to miss a good model. Therefore, further effort still need
to been made to improve the prediction performance and to reduce model size.
In this paper, an ecient two-stage gene selection (TSGS) method is proposed,
which can solve the small samples and variant correlation problems. The paper
is organized as follows. Section 2 gives some preliminaries. Section 3 presents the
proposed two-stage gene selection method. Simulation results are presented in
Section 4, followed by concluding remarks in Section 5.
2 Preliminaries
Suppose a set of data samples, denoted as D N =
,
where x i = x i 1 ,...,x iM ] T is the input vector, and y i is output. Let y =
[ y (1) ,...,y ( N )] T be the response and X =[ x 1 ; x 2 ; ... ; x N ]=[ x (1) ,x (2) ,...,x ( M ) ]
be model matrix, where x ( j ) =[ x 1 j ,...,x Nj ] T ,j =1 ,...,M represent the pre-
dictors (i.e., gene or candidate regressors). After a location and scale transfor-
mation procedure, the input and response variables are centred (mean=0) to
remove the intercept and also the input vectors are standardized, i.e.
{
( x i ,y i ) ,i =1 ,...,N
}
i =1 y i =0 ,
i =1 x ij =0 ,
i =1 x ij =1 ,j =1 ,...,M.
N
N
N
(1)
If these data are microarray gene expression data, then x ( j ) represents the j th
gene, x i represents the expression levels of M genes of the i th sample tissue and y i
may represent the tumor type. The gene selection problem can be approximated
by a linear-in-the-parameters model of the form
y = + Ξ,
(2)
where β =[ β 1 ,...,β M ] T
M is the estimated coecients, Ξ =[ ε (1) ,...,
ε ( N )] T
N is residual.
Different cost functions, often involving a trade-off between model complex-
ity and training accuracy, lead to alternative architectures. Therefore, the core
 
Search WWH ::




Custom Search