Information Technology Reference
In-Depth Information
The second term of the kernel is a weighted sum of ARD SE kernels. We use
to denote the set of time-invariant feature groups, or features that require a separate
length-scale for each dimension to function properly (e.g. “ First Episode rating” ,
Past 3 Episodes ratings” and “ Weekdays” ). Usually, the number of time-invariant
features is much less than the time co-varying features, so this term would not add too
many hyperparameters to the model.
As a brief example, assume that we consider co-varying time sequences, each
of them is of length . In our case 4 and there are 11 time co-varying
features (4 from “ Opinion” , 3 from “ Google Trends” , and 4 from “ Facebook” ). An
ARD kernel will have 44 hyperparameters to be learnt, making the inference slow and
the prediction inaccurate. On the other hand, the weight-sharing kernel introduces 2
hyperparameters (i.e. and ) for each feature group, merely 6 in total. With the
weight-sharing kernel applied, the inference is much faster and, as will be shown in
the Experiment section, a better performance is achieved.
4.3
Training
In general, the hyperparameters of a Gaussian process model can be learnt by max-
imizing the marginal likelihood or by using Markov chain Monte Carlo methods such
as slice sampling [10]. We adopt in this work the maximum marginal likelihood
framework, also known as Type-II maximum likelihood (ML-II) or empirical Bayes.
In the ML-II framework, the hyperparameters are chosen by maximizing the prob-
ability of observing target values given the input :1,…, . Let
, ,…, be a vector of function values evaluated at the
training input data points. Since the function ~ ,, is a random
function sampled from a GP, the vector is an N-dimensional normally distributed
random vector, i.e. | ~ 0, . The exact form of the marginal likelihood is giv-
en by marginalizing over the random vector :
| , | , | ,
(9)
where is a vector of hyperparameters of the kernel. From Eq. 1 it is clear that
~ 0, , where . It follows that the log marginal likelihood is
log|, 1
2 1
2 log
(10)
2 log2.
To find the best hyperparameters with ML-II, we must take the derivatives of the
log marginal likelihood with respect to the hyperparameters, as shown below:
log|, 1
2
1
2 tr
.
(11)
Search WWH ::




Custom Search