Database Reference
In-Depth Information
by using only a small number of carefully selected other sequences. We
can thus do some preprocessing of a training set, to find a promising
subset of sequences, and to apply MUSCLES only to those (hence the
name Selective MUSCLES).
Assume that sequence i is the one notoriously delayed and we need
to estimate its “delayed” values x t,i . For a given tracking window span
W ,amongthe v = W ∗ n + n − 1 independent variables, we have to
choose the ones that are most useful in estimating the delayed value of
x t,i . More generally, we want to solve the following
Problem 5.1 (Subset selection) Given v independent variables
x 1 ,x 2 ,...,x v and a dependent variable y with N samples each, find the
best b ( <v ) independent variables to minimize the mean-square error
for y for the given samples.
We need a measure of goodness to decide which subset of b variables
is the best we can choose. Ideally, we should choose the best subset
that yields the smallest estimation error in the future. Since, however,
we don't have future samples, we can only infer the expected estimation
error (EEE for short) from the available samples as follows:
N
( y [ t ] − y S [ t ]) 2
EEE( S )=
t =1
where S is the selected subset of variables and y S [ t ] is the estimation
based on S for the t -th sample. Note that, thanks to Eq. 5.3, EEE( S )
can be computed in O ( N
2 ) time. Let's say that we are allowed
to keep only b = 1 independent variable. Which one should we choose?
Intuitively, we could try the one that has the highest (in absolute value)
correlation coecient with y . It turns out that this is indeed optimal:
(to satisfy the unit variance assumption, we will normalize samples by
the sample variance within the window.)
·S
Lemma 5.2 Given a dependent variable y ,and v independent variables
with unit variance, the best single variable to keep to minimize EEE(
S
)
is the one with the highest absolute correlation coecient with y .
Proof. For a single variable, if a is the least squares solution, we can
express the error in matrix form as
2
2 a ( y T x i )+ a 2
2 .
EEE( {x i } )= y
x i
2 and ( x T y ), respectively. Since a = d 1 p ,
Let d and p denote
x i
2
p 2 d 1 . To minimize the error, we must choose x i
EEE(
{
x i }
)=
y
Search WWH ::




Custom Search