LAW: Link-AWare Source Selection for Virtually Integrating Linked Data - Technologies and Applications of Artificial Intelligence

Information Technology Reference

In-Depth Information

Web of Linked Data. Yet, so far, only little attention has been paid to the effect

of links between datasets on federated querying.

In this paper, we presents LAW, a link-aware approach to source selection

for federated querying over the Web of Data. We redefine the RDF graph as

the RDF triple link graph to reveal links between triples in one single dataset

or multiple datasets. We also define basic graph patterns in SPARQL as triple

pattern link graph to reveal links between triple patterns. To bridge the gap of

triple link graphs and triple pattern link graphs, we design a special statisti-

cal model called property link graph to approximate links between real linked

data. Moreover, LAW also provides a distributed join execution mechanism that

minimises network trac during executing selection plans.

Our main contribution presented in this paper is threefold. (1) We formalize

the RDF triple link graph and triple pattern graph. (2) We propose an ecient

approach of source selection. (3) We perform a comprehensive simulation study

based on the real dataset to evaluate our approaches.

The remainder of this paper is structured as follows. In Section 2 we review

related works. In Section 3 we present the background knowledge. Section 4

describes the statistical model. Source selection and the execution of selection

plans are presented in Section 5. An evaluation of our approach is given in Section

6. Finally, we conclude and discuss future directions in Section 7.

2 Related Works

DARQ [8] extends the popular query processor Jena ARQ to an engine for fed-

erated SPARQL queries. It requires users to explicitly supply a configuration

file which enables the query engine to decompose a query into sub-queries and

optimize joins based on predicate selectivity. SemWIQ [6] requires all subjects

must be variables and for each subject variable its type must be explicitly or

implicitly defined. Additional information (another triple pattern or DL con-

straints) is needed to tell the type for the subject of a triple pattern. It uses

these additional information and extensive RDF statistics to decompose the

original user query. DARQ [8] and SemWIQ [6] potentially assume that RDF

triples are independent from each other: if the property or subject class of one

triple pattern is defined by one dataset, then they are relevant. FedX[10] also po-

tentially adopts triple independency assumption. It asks all known data sources

by SPARQL ASK query form whether they contain matched data for each triple

pattern presented in a user query. FedSearch[7] is based on FedX and extends it

with sophisticated static optimization strategies. If the amount of known data

sources is very large(it is common in an open setting), the query performance

may leave much to be desired. SPLENDID [5] relies on the VOID descriptions

existing in remote data sources. However, a VOID description is not an integral

part of Linked Data principles[1].

In other cases, users are required to provide additional information to de-

termine the relevant data sources. For instance, [13] theoretically describes a

solution called Distributed SPARQL for distributed SPARQL query on the top

Technologies and Applications of Artificial Intelligence

Search WWH ::

Custom Search

Home