Information Technology Reference
In-Depth Information
along the -th dimension. The effect of these hyperparameters can be shown more
clearly if we rewrite Eq. 6 as
, exp
.
(7)
2
If the characteristic length-scale for the -th dimension is large, the -th exponen-
tial term will be close to zero, and the covariance will be independent of that input
dimension. This is a form of automatic relevance determination (ARD) [11] or “soft”
feature selection. When we are estimating the hyperparameters, the irrelevant input
dimensions will be ignored by fitting the length-scales to a relatively large value.
However, this type of kernel introduces one parameter per input dimension. The
common problem of overfitting is severe if we are dealing with high-dimensional
inputs [3]. This is especially the case for time series prediction. If we have types
of co-varying features, and for each of them we consider only time steps before
current prediction, the total number of features is , which will increase
rapidly as we consider longer historical sequences. This will limit the power of the
time series model in that it must either include less features or use shorter historical
sequences. Furthermore, the time needed to train an ARD kernel is significantly long-
er than its isotropic counterpart (i.e. setting ). The isotropic SE
kernel, although usually performs well, suffers from its inability to distinguish the
importance of different input dimensions. Therefore, we propose a weight-sharing
kernel that strikes a balance between the two.
In the field of time series prediction, the input features usually come from co-
varying time sequences and therefore are naturally grouped. For example, the features
extracted from Google Trends can be viewed as a feature group. The main idea is to
reduce the number of hyperparameters by sharing the same length-scale among fea-
tures belonging to the same group, while at the same time possessing the ability to
determine the importance of different groups of features.
The kernel consists of a weighted sum of SE kernels:
, exp 1
2
(8)
2
exp 1
.
,
The first term is a weighted sum of isotropic SE kernels which are designed for
time co-varying features. We denote the set of time co-varying feature groups (i.e.
Opinion” , “ Google Trends” , “ Facebook” ) as . For each feature group , the
number of features in the group is denoted as , and the overall importance of
the group is . The same length scale is shared among all features belonging to
the group. This can significantly reduce the number of hyperparameters.
Search WWH ::




Custom Search