Databases Reference
In-Depth Information
Doc: number of documents using the language feature
Dom: number of pay-level-domains (i.e., sites) using the language feature.
However, raw counts do not reflect the reality that the use of an OWL fea-
ture in one important ontology or vocabulary may often have greater practical
impact than use in a thousand obscure documents. Thus, we also look at the
prominence of use of different features. We use PageRank to quantify our notion
of prominence: PageRank calculates a variant of the Eigenvector centrality of
nodes (e.g., documents) in a graph, where taking the intuition of directed links
as “positive votes”, the resulting scores help characterise the relative prominence
(i.e., centrality) of particular documents on the Web [55,31].
In particular, we first rank documents in the corpus. To construct the graph,
we follow Linked Data principles and consider sources as nodes, where a directed
edge ( s 1 ,s 2 )
S is extended from source s 1 to s 2 iff get ( s 1 ) contains (in
any triple position) a URI that dereferences to document s 2 (i.e., there exists
a u
S
×
terms ( get ( s 1 )) such that redirs ( u )= s 2 ). We also prune edges to only
consider ( s 1 ,s 2 )when s 1 and s 2 are non-empty sources in our corpus. We then
apply a standard PageRank analysis over the resulting directed graph, using
the power iteration method with ten iterations. For reasons of space, we refer
the interested reader to [55] for more detail on PageRank, and to the following
thesis [38] for more details on the particular algorithms used for this paper.
With PageRank scores computed for all documents in the corpus, for each
RDFS and OWL language feature, we then present:
Rank the sum of PageRank scores for documents in which the language
feature is used.
With respect to Rank , under the random surfer model of PageRank [55],
given an agent starting from a random location and traversing documents on
(our sample of) the Web of Data through randomly selected dereferenceable
URIs, the Rank
value for a feature approximates the probability with which
that agent will be at a document using that feature after traversing ten links.
In other words, the score indicates the likelihood of an agent, operating over the
Web of Data based on dereferenceable principles, to encounter a given feature
during a random walk.
The graph extracted from the corpus consists of 7.411 million nodes and 198.6
million edges. Table 4 presents the top-10 ranked documents in our corpus, which
are dominated by core meta-vocabularies, documents linked therefrom, and other
popular vocabularies. 21
4.2 Survey of RDF(S)/OWL Features
Table 5 presents the results of the survey of RDF(S) and OWL usage in our
corpus, where for features with non-trivial semantics, we present the measures
21 We ran another similar analysis with links to and from core RDF(S) and OWL
vocabularies disabled. The results for the feature analysis remained similar. Mainly
owl:sameAs dropped several positions in terms of the sum of PageRank.
 
Search WWH ::




Custom Search