Information Technology Reference
There are some recent discussions on the document sampling strategies for learn-
ing to rank, such as [ 1 ]. It is possible that different sampling strategies will lead to
different effectiveness in training, however, currently these strategies have not been
applied in the LETOR datasets.
10.4 Feature Extraction
In this section, we introduce the feature representation of documents in LETOR.
The following principles are used in the feature extraction process.
1. To cover as many classical features in information retrieval as possible.
2. To reproduce as many features proposed in recent SIGIR papers as possible,
which use the OHSUMED, “Gov”, or “Gov2” corpus for their experiments.
3. To conform to the settings in the original papers.
For the “Gov” corpus, 64 features are extracted for each query-document pair, as
shown in Table 10.2 . Some of these features are dependent on both the query and the
document, some only depend on the document, and some others only depend on the
query. In the table, q represents a query, which contains terms t 1 ,...,t M ; TF(t i ,d)
denotes the number of occurrences of query term t i in document d . Note that if the
feature is extracted from a stream (e.g., title, or URL), TF(t i ,d) means the number
of occurrences of t i in the stream.
From the above table, we can find many classical information retrieval fea-
tures, such as term frequency and BM25 [ 16 ]. At the same time, there are also
many features extracted according to recent SIGIR papers. For example, Topical
PageRank and Topical HITS are computed according to [ 10 ]; sitemap and hyper-
link based score/feature propagations are computed according to [ 17 ] and [ 13 ],
HostRank is computed according to [ 20 ], and extracted title is generated accord-
ing to [ 6 ]. For more details about the features, please refer to the LETOR website
For the OHSUMED corpus, 45 features are extracted in total, as shown in Ta-
ble 10.3 . In the table,
means the total number of documents in the corpus. For
more details of these features, please refer to the LETOR website.
For the “Gov2” corpus, 46 features are extracted as shown in Table 10.4 .Again,
more details about these features can be found at the LETOR website.
| C |
10.5 Meta Information
In addition to the features, the following meta information has been provided in
Statistical information about the corpus, such as the total number of documents,
the number of streams, and the number of (unique) terms in each stream.