AN OVERVIEW ON VARIABLE SELECTION FOR LONGITUDINAL DATA - Quantitative Medical Data Analysis Using Mathematical Tools and Statistical Techniques

Biomedical Engineering Reference

In-Depth Information

mixed models can be found in [20], and some comparisons of covariance

selection are found in [28].

Example 1. In this example, we generated 200 data sets using Octave

code (a free version of Matlab), each consisting of n = 50 subjects with

each subject having J = 5 observations (i.e, all n i equals J = 5), from the

following linear model:

y ij = x ij + 3" ij ;

where = (3; 1; 0; 0; 2; 0; 0; 0) T

(i.e., there were 5 inactive and 3 active

predictors), and x ij

N 8 (0; ), where the diagonal elements of all equal

1, and all o-diagonal elements equal 0.6. Furthermore, (" i1 ;; " iJ ) T are

multivariate normal with AR(1) true correlation structure with = 0:7.

In our simulation, we compare the following GEE model selection cri-

teria:

(1) naive AIC ignoring correlation, dened as N log(RSS S =N) + 2df S ;

(2) naive C p , dened to be RSS S + 2df S

2 , where

2

b

is the MSE under

the full model;

(3) Cantoni's C p dened in (4.3);

(4) Pan's AIC, dened in (4.4);

(5) Fu's penalized GEE with L 1 penalty. The j were proportional to the

unpenalized standard errors; their magnitude was chosen using the

modied GCV-like statistic dened in [18].

(6) Penalized GEE with the SCAD penalty. The tuning parameters are

selected by using BIC 1 and BIC 2 tuning parameter selectors described

in Section 4.2. Corresponding to the BIC 1 and BIC 2 , this procedure

is referred to as SCAD 1 and SCAD 2 in Table 1, respectively.

To nd the subset which minimizes AIC and C p criteria in (1)|(4),

we exhaustively search all 2 8 possibilities. Thus, the corresponding results

represent best subset variable selection with the underlying criterion.

We compare each variable selection procedure in terms of model com-

plexity and model error, dened by ME(

b

) T E(xx T )(

) (see

[15]). Table 1 depicted the mean of model error for each procedure and

summarized model complexity in terms of correct deletions, the average

number per simulation of truly zero coecients correctly estimated as zero,

erroneous deletions, the average number of truly nonzero coecients erro-

neously set to zero, and proportion correct models, the proportion of trials

in which exactly the true subset of nonzero predictors was chosen.

) = (

Quantitative Medical Data Analysis Using Mathematical Tools and Statistical Techniques

Search WWH ::

Custom Search

Home