Information Technology Reference
Mutual information difference: Given a dataset, one can compute the mutual in-
formation (MI) to represent the dependency of terms in a query. Then the differ-
ence between the MI computed from the aforementioned two datasets is used as
Usage rate: If query terms appear in title and anchor text frequently, the corre-
sponding query does very possibly belong to the homepage finding task.
POS information: Some topic relevance task queries include a verb to explain
what the user wants to know. Therefore, by analyzing the part of speech informa-
tion, one can extract features for query classification.
With the above features, one can use their linear combination to perform query
After query classification, different algorithms and information are used for the
queries belonging to different categories. For the topic relevance task, the content
information (e.g., TF and IDF ) is emphasized; on the other hand, for the home-
page finding task, the link and URL information (e.g., PageRank, URL types (root,
subroot, path, or file)) is emphasized.
Experimental results show that in this way, the ranking performance can be im-
proved as compared to using a single ranking function.
7.2.2 K Nearest Neighbor-Based Approach
In [ 5 ], Geng et al. perform query-dependent ranking using the K Nearest Neighbor
In the method, the training queries are positioned into the query feature space
in which each query is represented by a feature vector. In the test process, given a
query, its k nearest training queries are retrieved and used to learn a ranking model.
Then this ranking model is used to rank the documents associated with the test
query. This method is named KNN Online in [ 5 ]. This idea has been tested with
Ranking SVM [ 7 , 8 ], and the experimental results show significant performance
improvements as compared to using one single ranking function and using query
classification-based query-dependent ranking. The explanations on this experimen-
tal finding are as follows: (i) in the method ranking for a test query is conducted
by leveraging the useful information of the similar training queries and avoiding
the negative effects from the dissimilar ones; (ii) 'soft' classification of queries is
carried out and similar queries are selected dynamically.
Despite the high ranking performance, one may have noticed a serious problem
with the aforementioned method. Its time complexity is too high to use in practice.
For each single test query, a KNN search plus a model training would be needed,
both of which are very costly. In order to tackle this efficiency problem, two alter-
native methods have been developed, namely KNN Offline-1 and KNN Offline-2,
which move the training offline and significantly reduce the test complexity.
KNN Offline-1 works in the following way (see Fig. 7.1 ). First, for each training
query q i , its k nearest neighbors, denoted as N k (q i ) , in the query feature space are