Information Technology Reference
In-Depth Information
509,659 queries were identified as either (fundamentally analog) people or places
by the named-entity recognizer, and we call these queries entity queries . Employing
WordNet to represent abstract concepts, we chose queries recognized by WordNet
that have both a hyponym and hypernym in WordNet. This resulted in a more
restricted 16,698 queries that are supposed to be about abstract concepts realized
by multiple entities, which we call concept queries .
A sample entity query from our list would be 'charles darwin,' while a sample
concept query would be 'violin.' In our data-set using hypertext search, both queries
return almost all relevant results. The query 'charles darwin' gives results that
are entirely encyclopedia pages (Wikipedia, eHow, darwin-online.org.uk ) and other
factual sources of information, while 'violin' returns eight out of ten factual pages,
with two results just being advertisements for violin makers. On the contrary for
the Semantic Web, the query 'charles darwin' had six relevant results, with the
rest being for places such as the city of Darwin and topics or products mentioning
Darwin. For 'violin,' only three contain relevant factual data, with the rest being
the names of albums called 'Violin' and movies such as 'The Violin Maker.' From
inspection of entities with relevant results, it appears the usual case for semantic
search is that DBpedia and WordNet have a substantial amount of overlap in
the concepts to which they give URIs. For example, they have distinct URIs for
such concepts as 'violin' ( http://dbpedia.org/resource/Violin vs. W3C WordNet's
synset-violin-noun-1 ). Likewise, most repetition of entity URIs comes
from WordNet and DBpedia, both of which have distinct URIs for famous people
like Charles Darwin. In many cases, these URIs do not always appear at the top, but
in the second or third position, with often an irrelevant URI at top. Lastly, much of
the RDF that is retrieved seems to have little information in it, with DBPedia and
WordNet being the most rich sources of information.
The results of running the selected queries against a Semantic Web search engine,
FALCON-S's Object Search (Cheng et al. 2008), were surprisingly fruitful. For
entity queries, there was an average of 1,339 URIs (S.D. 8,000) returned for each
query. On the other hand, for concept queries, there were an average of 26,294 URIs
(S.D. 14,1580) returned per query, with no queries returning zero documents. Such
a high standard deviation in comparison to the average is a sure sign of a non-normal
distribution such as a power-law distribution, and normal statistics such as average
and standard deviation are not good characteristic measures of such distributions. As
shown in Fig. 6.1 , when plotted in logarithmic space, both entity queries and concept
queries show a distribution that is heavily skewed towards a very large number of
high-frequency results, with a steep drop-off to almost zero results instead of the
characteristic long tail of a power law. For the vast majority of queries, far from
having no information, the Semantic Web of Linked Data appears to have too much
data , but for a minority of queries there is just no data . This is likely the result
of the releasing of Linked Data in large 'chunks' from data-silos about specific
topics rather than the more organic development of the hypertext Web that typically
results in power-law distributions. Also, note that hypertext web-pages are updated
as regards trends and current events much more quickly than the relatively slow-
moving world of Linked Data.
Search WWH ::




Custom Search