A Weight-Sharing Gaussian Process Model Using Web-Based Information for Audience Rating Prediction - Technologies and Applications of Artificial Intelligence

Information Technology Reference

In-Depth Information

The second term of the kernel is a weighted sum of ARD SE kernels. We use

to denote the set of time-invariant feature groups, or features that require a separate

length-scale for each dimension to function properly (e.g. “ First Episode rating” ,

“ Past 3 Episodes ratings” and “ Weekdays” ). Usually, the number of time-invariant

features is much less than the time co-varying features, so this term would not add too

many hyperparameters to the model.

As a brief example, assume that we consider co-varying time sequences, each

of them is of length . In our case 4 and there are 11 time co-varying

features (4 from “ Opinion” , 3 from “ Google Trends” , and 4 from “ Facebook” ). An

ARD kernel will have 44 hyperparameters to be learnt, making the inference slow and

the prediction inaccurate. On the other hand, the weight-sharing kernel introduces 2

hyperparameters (i.e. and ) for each feature group, merely 6 in total. With the

weight-sharing kernel applied, the inference is much faster and, as will be shown in

the Experiment section, a better performance is achieved.

4.3

Training

In general, the hyperparameters of a Gaussian process model can be learnt by max-

imizing the marginal likelihood or by using Markov chain Monte Carlo methods such

as slice sampling [10]. We adopt in this work the maximum marginal likelihood

framework, also known as Type-II maximum likelihood (ML-II) or empirical Bayes.

In the ML-II framework, the hyperparameters are chosen by maximizing the prob-

ability of observing target values given the input :1,…, . Let

, ,…, be a vector of function values evaluated at the

training input data points. Since the function ~ ,, is a random

function sampled from a GP, the vector is an N-dimensional normally distributed

random vector, i.e. | ~ 0, . The exact form of the marginal likelihood is giv-

en by marginalizing over the random vector :

| , | , | ,

(9)

where is a vector of hyperparameters of the kernel. From Eq. 1 it is clear that

~ 0, , where . It follows that the log marginal likelihood is

log|, 1

2 1

2 log

(10)

2 log2.

To find the best hyperparameters with ML-II, we must take the derivatives of the

log marginal likelihood with respect to the hyperparameters, as shown below:

∂ log|, 1

∂

2 ∂

∂ 1

2 tr ∂

∂ .

(11)

Technologies and Applications of Artificial Intelligence

Search WWH ::

Custom Search

Home