Information Technology Reference
In-Depth Information
Chapter 12
Other Datasets
Abstract In this chapter, we introduce two new benchmark datasets, released by
Yahoo and Microsoft. These datasets originate from the training data used in com-
mercial search engines and are much larger than the LETOR datasets in terms of
both number of queries and number of documents per query.
12.1 Yahoo! Learning-to-Rank Challenge Datasets
Yahoo! Labs organized a learning-to-rank challenge from March 1 to May 31,
2010. Given that learning to rank has become a very hot research area, many re-
searchers have participated in this challenge and tested their own algorithms. There
were 4,736 submissions coming from 1,055 teams, and the results of this challenge
were summarized at a workshop at the 27th International Conference on Machine
Learning (ICML 2010) in Haifa, Israel. The official website of this challenge is
http://learningtorankchallenge.yahoo.com/ .
According to the website of the challenge, the datasets used in this challenge
come from web search ranking and are of a subset of what Yahoo! uses to train its
ranking function.
There are two datasets for this challenge, each corresponding to a different coun-
try: a large one and a small one. The two datasets are related, but also different to
some extent. Each dataset is divided into three sets: training, validation, and test.
The statistics for the various sets are as shown in Tables 12.1 and 12.2 .
The datasets consist of feature vectors extracted from query-URL pairs along
with relevance judgments. The relevance judgments can take five different values
from 0 (irrelevant) to 4 (perfectly relevant). There are 700 features in total. Some
of them are only defined in one dataset, while some others are defined in both sets.
When a feature is undefined for a set, its value is 0. All the features have been nor-
malized to be in the
range. The queries, URLs, and feature descriptions are
not disclosed, only the feature values, because of the following reason. Feature en-
gineering is a critical component of any commercial search engine. For this reason,
search engine companies rarely disclose the features they use. Releasing the queries
and URLs would lead to a risk of reverse engineering of the features. This is a
reasonable consideration; however, it will prevent information retrieval researchers
from studying what kinds of feature are the most effective ones for learning-to-rank.
[
0 , 1
]
Search WWH ::




Custom Search