Information Technology Reference
In-Depth Information
With query q as input, documents are ranked based on the query likelihood, or
the probability that the document's language model will generate the terms in the
query (i.e., P(q
|
d) ). By further assuming the independence between terms, one has
= i = 1 P(t i |
P(q
d) , if query q contains terms t 1 ,...,t M .
To learn the document's language model, a maximum likelihood method is used.
As in many maximum likelihood methods, the issue of smoothing the estimate is
critical. Usually a background language model estimated using the entire collection
is used for this purpose. Then, the document's language model can be constructed
as follows:
|
d)
λ) TF(t i ,d)
p(t i |
d)
=
( 1
LEN(d) +
λp (t i |
C),
(1.3)
where p(t i |
∈[
]
C) is the background language model for term t i , and λ
0 , 1
is a
smoothing factor.
There are many variants of LMIR, some of them even go beyond the query like-
lihood retrieval model (e.g., the models based on K-L divergence [ 92 ]). We would
not like to introduce more about it, and readers are encouraged to read the tutorial
authored by Zhai [ 90 ].
In addition to the above examples, many other models have also been proposed
to compute the relevance between a query and a document. Some of them [ 74 ]
take the proximity of the query terms into consideration, and some others consider
the relationship between documents in terms of content similarity [ 73 ], hyperlink
structure [ 67 ], website structure [ 60 ], and topic diversity [ 91 ].
1.2.1.2 Importance Ranking Models
In the literature of information retrieval, there are also many models that rank doc-
uments based on their own importance. We will take PageRank as an example for
illustration. This model is particularly applicable to Web search because it makes
use of the hyperlink structure of the Web for ranking.
PageRank uses the probability that a surfer randomly clicking on links will arrive
at a particular webpage to rank the webpages. In the general case, the PageRank
value for any page d u can be expressed as
PR(d v )
U(d v ) .
PR(d u )
=
(1.4)
d v B u
That is, the PageRank value for a page d u is dependent on the PageRank values
for each page d v out of the set B u (containing all pages linking to page d u ), divided
by U(d v ) , the number of outlinks from page d v .
To get a meaningful solution to ( 1.4 ), a smoothing term is introduced. When a
random surfer walks on the link graph, she/he does not necessarily always follow the
existing hyperlinks. There is a small probability that she/he will jump to any other
Search WWH ::




Custom Search