Information Technology Reference
In-Depth Information
Table 10.1 Number of
queries in TREC Web track
Task
TREC2003
TREC2004
Topic distillation
50
75
Homepage finding
150
75
Named page finding
150
75
10.2.1 The “Gov” Corpus and Six Query Sets
In TREC 2003 and 2004, a special track for web information retrieval, named the
Web track, 4 was organized. The track used the “Gov” corpus, which is based on a
January, 2002 crawl of the .gov domain. There are in total 1,053,110 html documents
in this corpus.
There are three search tasks in the Web track: topic distillation (TD), homepage
finding (HP), and named page finding (NP). Topic distillation aims to find a list of
entry points for good websites principally devoted to the topic. Homepage finding
aims at returning the homepage of the query. Named page finding aims to return the
page whose name is exactly identical to the query. Generally speaking, there is only
one answer for homepage finding and named page finding. The numbers of queries
in these three tasks are shown in Table 10.1 . For ease of reference, we denote the
query sets in these two years as TD2003, TD2004, HP2003, HP2004, NP2003, and
NP2004, respectively.
Due to the large scale of the corpus, it is not feasible to check every document
and judge its relevance to a given query. The practice in TREC is as follows. Given
a query, only some “possibly” relevant documents, which are ranked high in the
runs submitted by the participants, are selected for labeling. Given a query, human
assessors are asked to label whether these possibly relevant documents are really
relevant. All the other documents, including those checked but not labeled as rele-
vant by the human assessors and those not ranked high in the submitted runs at all,
are regarded as irrelevant in the evaluation process [ 2 ].
Many research papers [ 11 , 13 , 19 , 20 ] have used the three tasks on the “Gov”
corpus as their experimental platform.
10.2.2 The OHSUMED Corpus
The OHSUMED corpus [ 5 ] is a subset of MEDLINE, a database on medical pub-
lications. It consists of 348,566 records (out of over 7,000,000) from 270 medical
journals during the period of 1987-1991. The fields of a record include title, ab-
stract, MeSH indexing terms, author, source, and publication type.
4 http://trec.nist.gov/tracks.html .
Search WWH ::




Custom Search