Information Technology Reference
In-Depth Information
The second term of the kernel is a weighted sum of ARD SE kernels. We use
to denote the set of time-invariant feature groups, or features that require a separate
length-scale for each dimension to function properly (e.g. “
First Episode rating”
,
“
Past 3 Episodes ratings”
and “
Weekdays”
). Usually, the number of time-invariant
features is much less than the time co-varying features, so this term would not add too
many hyperparameters to the model.
As a brief example, assume that we consider
co-varying time sequences, each
of them is of length
. In our case
4
and there are
11
time co-varying
features (4 from “
Opinion”
, 3 from “
Google Trends”
, and 4 from “
Facebook”
). An
ARD kernel will have 44 hyperparameters to be learnt, making the inference slow and
the prediction inaccurate. On the other hand, the weight-sharing kernel introduces 2
hyperparameters (i.e.
and
) for each feature group, merely 6 in total. With the
weight-sharing kernel applied, the inference is much faster and, as will be shown in
the Experiment section, a better performance is achieved.
4.3
Training
In general, the hyperparameters of a Gaussian process model can be learnt by max-
imizing the marginal likelihood or by using Markov chain Monte Carlo methods such
as slice sampling [10]. We adopt in this work the maximum marginal likelihood
framework, also known as Type-II maximum likelihood (ML-II) or empirical Bayes.
In the ML-II framework, the hyperparameters are chosen by maximizing the prob-
ability of observing target values
given the input
:1,…,
. Let
,
,…,
be a vector of function values evaluated at the
training input data points. Since the function
~ ,,
is a random
function sampled from a GP, the vector
is an N-dimensional normally distributed
random vector, i.e.
| ~ 0,
. The exact form of the marginal likelihood is giv-
en by marginalizing over the random vector
:
|
,
|
,
|
,
(9)
where
is a vector of hyperparameters of the kernel. From Eq. 1 it is clear that
~ 0,
, where
. It follows that the log marginal likelihood is
log|,
1
2
1
2
log
(10)
2
log2.
To find the best hyperparameters with ML-II, we must take the derivatives of the
log marginal likelihood with respect to the hyperparameters, as shown below:
∂
log|,
1
∂
2
∂
∂
1
2
tr
∂
∂
.
(11)
Search WWH ::
Custom Search