Information Technology Reference
In-Depth Information
14.1.1.2 Regularizers
A popular choice of the regularizer
ʨ( ʲ )
in ( 14.1 ) to perform feature selection
(without considering feature groups) is the
1 norm of
ʲ
, which is also known as the
lasso penalty [ 23 ],
p
ʨ( ʲ ) := ʻ ʲ 1 = ʻ
1 | ʲ j | ,
j
=
for a given
ʲ 0 is usually not
included in regularization. A property of the lasso penalty is that when a feature is
not important for fitting responses with respect to a given value of
ʻ>
0. This is a convex function in
ʲ
. The bias term
ʻ
, the lasso penalty
sets the corresponding coefficient in
to the exactly zero value. So there is no need
for thresholding to filter out irrelevant features after finding solutions of ( 14.1 ). In
fact, the value of
ʲ
plays a similar role to a threshold value, as we will see later.
When features are correlated, lasso tends to select only few out of the correlated
features (in an unstable way, especially when p
ʻ
n ). This is not desirable when all
correlated features may matter and therefore have to be selected. A remedy for this
behavior is to use the elastic net regularization [ 26 ], which augments
ʨ
above as
follows,
2
ʨ( ʲ ) := ʻ
ʱ ʲ 1 + (
1
ʱ) ʲ
.
(14.3)
Here
ʲ 2 is the
2 norm (the Euclidean norm) of
ʲ
. The parameter
ʱ ∈[
0
,
1
]
controls the mixing of the
1 and
2 regularizers: the case of
ʱ =
0 is often referred
to as the ridge regression, and for
1 it becomes the lasso penalty. Elastic net
tends to select all correlated features when they are relevant. So correlated groups of
features will be identified, but they may not correspond to known groups of features.
ʱ =
The rest of this chapter is organized as follows. In Sect. 14.2 , the extensions of
lasso, namely the group lasso (Sect. 14.2.1 ), the overlapping group lasso (Sect. 14.2.2 ),
and the sparse group lasso (Sect. 14.2.3 ) algorithms are introduced, discussing their
properties and differences. Acase study on exonmicroarray data follows in Sect. 14.3 ,
demonstrating a possible use of grouped feature selection in bioinformatics. Some
technical issues of the methods are discussed in Sect. 14.4 , followed by conclusions
in Sect. 14.5 .
14.2 Regularized Regression Methods for Grouped Features
When group information on features is available, we can impose it as an extra
constraint for feature selection. Suppose that p features are grouped into K groups,
where we represent each group of G 1 ,
G 2 ,...,
G K as a subset of feature indices, that
is, G k ↂ{
. For simplicity we assume that all features have their groups
assigned, in other words
1
,
2
,...,
p
}
K
k
1 G k ={
1
,
2
,...,
p
}
.
=
 
Search WWH ::




Custom Search