Information Technology Reference
bust the training scheme is to noisy labels. Therefore, instead of mixing the human-
labeled and automatically-labeled documents, it is better to treat the two types of
documents separately. That is, when using RankBoost, document pairs are con-
structed only within the same type of documents, and separate distributions are
maintained for the two types of document pairs. In this way, the two types of data
will contribute separately to the overall loss function (a parameter λ is used to trade
off the losses corresponding to different types of documents). It has been proven
that with such a treatment the training process can still converge, and some original
nice properties of RankBoost can be inherited.
The proposed method has been tested on a couple of bipartite ranking tasks,
with AUC (the area under the ROC curve) as the evaluation measure. The experi-
mental results have shown that the proposed approach can improve the accuracy of
bipartite ranking, and even when there is only a small amount of labeled data, the
performance can be very good by well utilizing the unlabeled data.
8.2 Transductive Approach
In [ 2 ], a transductive approach is taken; the key idea is to automatically derive better
features using the unlabeled test data to improve the effectiveness of model training.
In particular, an unsupervised learning method (specifically, kernel PCA in [ 2 ]) is
applied to discover salient patterns in each list of retrieved test documents. In total
four different kernels are used: polynomial kernel, RBF kernel, diffusion kernel,
and linear kernel (in this specific case, kernel PCA becomes exactly PCA). The
training data are then projected onto the directions of these patterns and the resulting
numerical values are added as new features. The main assumption in this approach is
that this new training set (after projection) better characterizes the test data, and thus
should outperform the original training set when learning rank functions. RankBoost
[ 3 ] is then taken as an example algorithm to demonstrate the effectiveness of this
Extensive experiments on the LETOR benchmark datasets (see Chap. 10)have
been conducted in [ 2 ] to test the effectiveness of the proposed transductive approach.
The experimental results show that the ranking performance can be improved by us-
ing the unlabeled data in this way. At the same time, detailed analyses have been
performed on some issues related to this semi-supervised learning process, e.g.,
whether the Kernel PCA features can be interpreted, whether non-linear Kernel PCA
helps, how the performance varies across queries, and what the computational com-
plexity is. The general conclusions are as follows.
Kernel PCA features are in general difficult to interpret and most Kernel PCA
features have little correlation to the original features.
Non-linearity is important in most cases, but one should not expect non-linear
kernels to always outperform linear ones. The best strategy is to employ multiple