The LETOR Datasets - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

There are some recent discussions on the document sampling strategies for learn-

ing to rank, such as [ 1 ]. It is possible that different sampling strategies will lead to

different effectiveness in training, however, currently these strategies have not been

applied in the LETOR datasets.

10.4 Feature Extraction

In this section, we introduce the feature representation of documents in LETOR.

The following principles are used in the feature extraction process.

1. To cover as many classical features in information retrieval as possible.

2. To reproduce as many features proposed in recent SIGIR papers as possible,

which use the OHSUMED, “Gov”, or “Gov2” corpus for their experiments.

3. To conform to the settings in the original papers.

For the “Gov” corpus, 64 features are extracted for each query-document pair, as

shown in Table 10.2 . Some of these features are dependent on both the query and the

document, some only depend on the document, and some others only depend on the

query. In the table, q represents a query, which contains terms t 1 ,...,t M ; TF(t i ,d)

denotes the number of occurrences of query term t i in document d . Note that if the

feature is extracted from a stream (e.g., title, or URL), TF(t i ,d) means the number

of occurrences of t i in the stream.

From the above table, we can find many classical information retrieval fea-

tures, such as term frequency and BM25 [ 16 ]. At the same time, there are also

many features extracted according to recent SIGIR papers. For example, Topical

PageRank and Topical HITS are computed according to [ 10 ]; sitemap and hyper-

link based score/feature propagations are computed according to [ 17 ] and [ 13 ],

HostRank is computed according to [ 20 ], and extracted title is generated accord-

ing to [ 6 ]. For more details about the features, please refer to the LETOR website

http://research.microsoft.com/~LETOR/ .

For the OHSUMED corpus, 45 features are extracted in total, as shown in Ta-

ble 10.3 . In the table,

means the total number of documents in the corpus. For

more details of these features, please refer to the LETOR website.

For the “Gov2” corpus, 46 features are extracted as shown in Table 10.4 .Again,

more details about these features can be found at the LETOR website.

| C |

10.5 Meta Information

In addition to the features, the following meta information has been provided in

LETOR.

•

Statistical information about the corpus, such as the total number of documents,

the number of streams, and the number of (unique) terms in each stream.

Search WWH ::

Custom Search

Home