Information Technology Reference
In-Depth Information
is evaluated against 'real-world' deployed systems in Sect. 6.8 . Finally, in Sect. 6.9
future work on this particular system is detailed, and conclusions on the veracity of
our method of sense-making are given in Sect. 6.9.1 .
6.2
Is There Anything Worth Finding on the Semantic Web?
In this section we demonstrate that the Semantic Web does indeed contain infor-
mation relevant to ordinary users by sampling the Semantic Web using real-world
queries referring to entities and concepts from the query log of a major search
engine. The main problem confronting any study of the Semantic Web is one
of sampling . As almost any large-data database can easily be exported to RDF,
statistics demonstrating the actual deployment of the Semantic Web can be biased
by the automated release of large, if useless, data-sets, the equivalent of 'Semantic
Web' spam. Also, large specialized databases like Bio2RDF can easily dwarf the rest
of the Semantic Web in size. A more appropriate strategy would be to try to answer
the question: What information is available on the Semantic Web that users are
actually interested in? The first large-scale analysis of the Semantic Web was done
via an inspection of the index of Swoogle by Ding and Finin (2006). The primary
limitation of that study was that the large majority of the Semantic Web resources
sampled did not contain rich information that many people would find interesting.
For example, the vast majority of data on the Semantic Web in 2006 was Livejournal
exporting every user's profile as FOAF and RSS 1.0 data that used Semantic Web
techniques to structure the syntax of news feeds. Yet with information-rich and
interlinked databases like Wikipedia being exported to the Semantic Web, today
the Semantic Web may contain information needed by actual users. As there is no
agreed-upon fashion to sample the Semantic Web (and the entire Web) in a fair
manner, we will for our evaluation create a sample driven by queries from real-
users using easily-accessible search engines that claim to have a Web-scale index,
although independent verification of this is difficult if not impossible.
6.2.1
Inspecting the Semantic Web
In order to select real queries from users for our experiment, we used the query log
of a popular hypertext search engine, the Web search query log of approximately
15 million distinct queries from Microsoft Live Search. This query log contained
6,623,635 unique queries corrected for capitalization. The main issue in using a
query log is to get rid of navigational and transactional queries. A straightforward
gazetteer-based and rule-based named entity recognizer was employed to discover
the names of people and places (Mikheev et al. 1998), based off a list of names
maintained by the Social Security Administration and a place name database
provided by the Alexandria Digital Library Project. From the query log a total of
Search WWH ::




Custom Search