The Semantics of Search - Social Semantics: The Search for Meaning on the Web

Information Technology Reference

In-Depth Information

from relevant documents to expand the query that have limited ambiguity, and so it

does extra processing compared to the okapi method that simply averages the most

frequent words in the relevant documents. In comparison, Local Content Analysis

performs an operation similar in effect to tf

id f on the possibly relevant terms,

and so attempting by virtue of weighing to select only words w that both appear

frequently with terms in query q but have a low overall frequency ( id f w ) in all the

results.

The final method we will use is the heuristic method developed by Ponte (1998),

which we call ponte .Like lca , ponte ranks each word w

V , but it does so

differently. Instead of taking a heuristic-approach like Okapi or LCA , it takes a

probabilistic approach. Given a set of relevant documents R

∈

D , Ponte's approach

estimates the probability of each word w

∈

V being in the relevant document,

(

)

, divided by its overall probability of the word to occur in the results P

(

)

Then the Pont e approach gives each w

V a score as given in ( 6.5 ) and then expands

the query by using the m most relevant words as ranked by their scores.

∈

)= D ∈ R log P ( w | D )

(

Pont e

w ; R

(6.5)

(

)

6.3.2

Language Models

6.3.2.1

Representation

Language modeling frameworks in information retrieval represent each document

as a language model given by an underlying multinomial probability distribution

of word occurrences. Thus, for each word w

V there is a value that gives how

likely an observation of word w is given D ,i.e. P

∈

(

u D (

))

. The document model

distribution u D (

ε D , which allows a linear

interpolation that takes into account the background probability of observing w in

the entire collection C .Thisisgivenin( 6.6 ).

)

is then estimated using the parameter

)= ε D n

(

)

(

)

u D (

− ε D )

(6.6)

∑ v ∈ V n

(

)

D just takes into account the relative likelihood of the word as

observed in the given document D compared to the word given the entire collection

of documents C .

The parameter

(

)

is the total number of words in document D , while n

is the

(

)

frequency of word d in document D .Further, n

is the frequency of occurrence

of the word w in the entire collection C divided by the occurrence of all words v in

collection C .

Social Semantics: The Search for Meaning on the Web

Search WWH ::

Custom Search

Home