A Weight-Sharing Gaussian Process Model Using Web-Based Information for Audience Rating Prediction - Technologies and Applications of Artificial Intelligence

Information Technology Reference

In-Depth Information

along the -th dimension. The effect of these hyperparameters can be shown more

clearly if we rewrite Eq. 6 as

, exp

.

(7)

2

If the characteristic length-scale for the -th dimension is large, the -th exponen-

tial term will be close to zero, and the covariance will be independent of that input

dimension. This is a form of automatic relevance determination (ARD) [11] or “soft”

feature selection. When we are estimating the hyperparameters, the irrelevant input

dimensions will be ignored by fitting the length-scales to a relatively large value.

However, this type of kernel introduces one parameter per input dimension. The

common problem of overfitting is severe if we are dealing with high-dimensional

inputs [3]. This is especially the case for time series prediction. If we have types

of co-varying features, and for each of them we consider only time steps before

current prediction, the total number of features is , which will increase

rapidly as we consider longer historical sequences. This will limit the power of the

time series model in that it must either include less features or use shorter historical

sequences. Furthermore, the time needed to train an ARD kernel is significantly long-

er than its isotropic counterpart (i.e. setting ). The isotropic SE

kernel, although usually performs well, suffers from its inability to distinguish the

importance of different input dimensions. Therefore, we propose a weight-sharing

kernel that strikes a balance between the two.

In the field of time series prediction, the input features usually come from co-

varying time sequences and therefore are naturally grouped. For example, the features

extracted from Google Trends can be viewed as a feature group. The main idea is to

reduce the number of hyperparameters by sharing the same length-scale among fea-

tures belonging to the same group, while at the same time possessing the ability to

determine the importance of different groups of features.

The kernel consists of a weighted sum of SE kernels:

, exp 1

2

∈

(8)

2

exp 1

.

,

∈

The first term is a weighted sum of isotropic SE kernels which are designed for

time co-varying features. We denote the set of time co-varying feature groups (i.e.

“ Opinion” , “ Google Trends” , “ Facebook” ) as . For each feature group , the

number of features in the group is denoted as , and the overall importance of

the group is . The same length scale is shared among all features belonging to

the group. This can significantly reduce the number of hyperparameters.

Technologies and Applications of Artificial Intelligence

Search WWH ::

Custom Search

Home