Information Technology Reference
In-Depth Information
from relevant documents to expand the query that have limited ambiguity, and so it
does extra processing compared to the okapi method that simply averages the most
frequent words in the relevant documents. In comparison, Local Content Analysis
performs an operation similar in effect to tf
id f on the possibly relevant terms,
and so attempting by virtue of weighing to select only words w that both appear
frequently with terms in query q but have a low overall frequency ( id f w ) in all the
results.
The final method we will use is the heuristic method developed by Ponte (1998),
which we call ponte .Like lca , ponte ranks each word w
.
V , but it does so
differently. Instead of taking a heuristic-approach like Okapi or LCA , it takes a
probabilistic approach. Given a set of relevant documents R
D , Ponte's approach
estimates the probability of each word w
V being in the relevant document,
P
(
w
|
D
)
, divided by its overall probability of the word to occur in the results P
(
w
)
.
Then the Pont e approach gives each w
V a score as given in ( 6.5 ) and then expands
the query by using the m most relevant words as ranked by their scores.
)= D R log P ( w | D )
(
Pont e
w ; R
(6.5)
P
(
w
)
6.3.2
Language Models
6.3.2.1
Representation
Language modeling frameworks in information retrieval represent each document
as a language model given by an underlying multinomial probability distribution
of word occurrences. Thus, for each word w
V there is a value that gives how
likely an observation of word w is given D ,i.e. P
(
w
|
u D (
v
))
. The document model
distribution u D (
ε D , which allows a linear
interpolation that takes into account the background probability of observing w in
the entire collection C .Thisisgivenin( 6.6 ).
v
)
is then estimated using the parameter
)= ε D n
(
w
,
D
)
n
(
w
,
C
)
u D (
w
+(
1
ε D )
(6.6)
|
D
|
v V n
(
v
,
C
)
D just takes into account the relative likelihood of the word as
observed in the given document D compared to the word given the entire collection
of documents C .
The parameter
ε
|
|
(
,
)
D
is the total number of words in document D , while n
w
D
is the
(
,
)
frequency of word d in document D .Further, n
is the frequency of occurrence
of the word w in the entire collection C divided by the occurrence of all words v in
collection C .
w
C
 
Search WWH ::




Custom Search