Database Reference
In-Depth Information
application of the MapReduce paradigm for SPARQL. Our evaluations demonstrate
how dramatically PigSPARQL's optimization reduces the amount of data to be
handled and the corresponding query processing time. Along other optimizations,
especially the vertical partitioning [20], which has a wide influence on the overall
performance, but comes at the cost of a preprocessing step, that has to be done once
in advance. However, we would like to stress that we could observe linear scalability
also for query Q6, which might be highly problematic when not executed in a distrib-
uted environment. This claim is justified by the observation that in Q6 we first have
to compute all publications with respect to all authors before we can find out those
authors who have not published in the years before; hence, the query produces a large
amount of intermediate results. One further important advantage of PigSPARQL is
its simplicity in the sense of usability. Other research approaches in this area often
do not provide their implementations at all or they cannot be used out of the box,
as they are either not maintained anymore or just nonstable proof-of-concept imple-
mentations. In contrast, PigSPARQL is ready to download and can be executed on
every Hadoop cluster with Apache Pig installed. There is neither an installation nor
a configuration required, as even the data loading and partitioning is done using an
included Pig Latin script. Moreover, as PigSPARQL translates SPARQL into Pig
Latin, it benefits from further developments of Apache Pig and stays compatible
with newer versions of Hadoop. Updating Apache Pig from version 0.5.0 to 0.10.0
improved our execution times in a range of 20% to 40% without changing a single
line of code.
However, while the performance and scaling properties of PigSPARQL for com-
plex analytical queries are competitive, the performance for selective queries is not
satisfying. The reduce-side-based query execution requires to transfer the whole data
that is going to be joined together through the network as join computation is done
in the reduce phase. In particular rather selective queries suffer from this fact, since
a large amount of unneeded data is processed, which could be avoided using more
sophisticated join techniques based on index structures. In the following sections
we describe an alternative join approach optimized for selective patterns where join
computation is done in the map phase by utilizing the NoSQL data store HBase as a
distributed index structure. While this approach retains the flexibility of commonly
used reduce-side joins, it leverages the effectiveness of map-side joins without any
changes to the underlying MapReduce framework. As we show in a further evalua-
tion in Section 5.7, MAPSIN can improve query performance for selective queries
by an order of magnitude compared with a classical reduce-side join execution, as
used for PigSPARQL.
5.5 RDF STORAGE SCHEMA FOR Hbase
Before introducing MAPSIN, we first have to discuss our RDF storage schema, that
enables the storage of arbitrary RDF graphs in HBase as there is no straightforward
mapping from the RDF data model to the HBase data model. Therefore, we will
start the second part of this chapter with a short outline of HBase followed by a
presentation of our RDF storage schema for HBase, that provides the required pre-
conditions for processing MAPSIN joins without the usage of a reduce phase. HBase
Search WWH ::




Custom Search