Database Reference
In-Depth Information
which maximize
p
2
and minimize
d
. Assuming unit-variance (
d
=1),
such
x
i
is the one with the biggest correlation coecient to
y
.Th s
concludes the proof.
The question is how we should handle the case when
b>
1. Normally,
we should consider all the possible groups of
b
independent variables,
and try to pick the best. This approach explodes combinatorially; thus
we propose to use a greedy algorithm. At each step
s
, we select the inde-
pendent variable
x
s
that minimizes the EEE for the dependent variable
y
, in light of the
s
1 independent variables that we have already chosen
in the previous steps.
The bottleneck of the algorithm is clearly the computation of EEE.
Since it computes EEE approximately
O
(
v · b
) times and each computa-
tion of EEE requires
O
(
N ·b
2
) in average, the overall complexity mounts
to
O
(
N · v · b
3
). To reduce the overhead, we observe that intermediate
results produced for EEE(
S
) can be re-used for EEE(
S∪{x}
).
−
Lemma 5.3
The complexity of the greedy selection algorithm is
O
(
N
·
b
2
)
.
v
·
+
be
+
)istheinverse
S
S∪{
x
}
S
Proof.
Let
. The core in computing EEE(
of
D
S
+
=(
X
T
S
+
X
S
+
). Thanks to block matrix inversion formula [34] (p.
656) and the availability of
D
−
1
from the previous iteration step, it can
S
2
). Hence, summing it up over
v
be computed in
O
(
N
·|S|
+
|
S
|
−|S|
b
2
+
v
b
3
)
remaining variables for each
b
iteration, we have
O
(
N
·
v
·
·
b
2
).
complexity. Since
N
b
, it reduces to
O
(
N
·
v
·
Subset-selection can be done infrequently and off-line, say every
N
=
W
time-ticks. The optimal choice of the reorganization window is beyond
the scope of this chapter. Potential solutions include (a) doing reorga-
nization during off-peak hours, (b) triggering a reorganization whenever
the estimation error for by increases above an application-dependent
threshold etc. Also, by normalizing the training set, the unit-variance
assumption in Theorem 1 can be easily satisfied.
6. Tracking Correlations and Hidden Variables:
SPIRIT
In this section we present a framework for discovering patterns in
multiple streams. In the next section, we show how these can be used
to perform effective, low-cost forecasting. We use auto-regression for its
simplicity, but our framework allows any forecasting algorithm to take
advantage of the compact representation of the stream collection.