The Semantics of Search - Social Semantics: The Search for Meaning on the Web

Information Technology Reference

In-Depth Information

performed better than rm relevance models ( p

05). The baseline for language

modeling was also fairly poor with an average performance of 0.4284 ( p

<

0

.

05).

This was the 'best' baseline using again an m of 10,000 for document models and

cross entropy smoothing

<

0

.

of 0.99. The general trends from the previous experiment

then held, except the smoothing factor was more moderate and the difference

between tf and rm was even more pronounced. However, the primary difference

worth noting was that the best performing tf language model outperformed, if

barely, the okapi ( BM 25 and inquery ) vector model by a relatively small but still

significant margin of 0.0126. Statistically, the difference was significant ( p

ε

<

0

.

05).

6.5.2.2

Discussion

Why is tf relevance modeling better than BM 25 and inquery vector-space models

in using relevance feedback from the Semantic Web to hypertext? The high perfor-

mance of BM 25 and inquery has already been explained, and that explanation about

why document-based normalization leads to worse performance still holds. Yet the

rise in performance of tf language models seems odd. However, it makes sense

if one considers the nature of the data involved. Recalling previous work (Halpin

2009a), there are two distinct conditions that separate this data-set from the more

typical natural language samples as encountered in TREC (Hawking et al. 2000).

In the case of using relevant hypertext results as feedback for the Semantic Web,

the relevant document model was constructed from a very limited amount of messy

hypertext data, which had many text fragments, with a large percentage coming

from irrelevant textual data to deal with issues like web-page navigation. However,

in using the Semantic Web for relevance feedback, these issues are reversed: the

relevant document model is constructed out of relatively pristine Semantic Web

documents and compared against noisy hypertext documents.

Rather shockingly, as the Semantic Web is mostly manually high-quality curated

data from sources like DBpedia, the actual natural language fragments found on

the Semantic Web, such as Wikipedia abstracts, are much better samples of natural

language than the natural language samples found in hypertext. Furthermore, the

distribution of 'natural' language terms extracted from RDF terms (such as 'sub

class of' from rdfs:subClassOf ), while often irregular, will either be repeated

very heavily or fall into the sparse long tail. These two conditions can then be dealt

with by the generative tf relevance models, since the long tail of automatically

generated words from RDF will blend into the long tail of natural language terms,

and the probabilistic model can properly 'dampen' without resorting to heuristic-

driven non-linearities. Therefore, it is on some level not surprising that even

hypertext Web search results can be improved by Semantic Web search results,

because used in combination with the right relevance feedback parameters, in

essence the hypertext search engine is being 'seeded' with high-quality structured

and accurate descriptions of the information need of the query to be used for query

expansion.

Social Semantics: The Search for Meaning on the Web

Search WWH ::

Custom Search

Home