Database Reference
In-Depth Information
6.3.3 Q uery P roCessing s ystems on e XtenDeD h aDooP P latForms
A variety of Hadoop-based platforms support extensions to the data flow specifica-
tion layer, the data-processing layer, as well as the underlying storage models as
shown in Figure 6.3. Systems such as Apache Hive and Pig allow users to express
data-processing tasks using high-level query primitives that are automatically
compiled into low-level map and reduce functions. Works such as PigSPARQL
translate a SPARQL query into Pig's high-level data flow language called as Pig
Latin, incorporating basic optimizations such as early filter/projection and the rear-
rangement of triple patterns based on variable counting. Similar to SHARD, this
translation produces one JOIN command in Pig for each join in query. A detailed
description of the query compilation process in Pig will be discussed in the next
section. Extended platforms with indexed access methods have also been proposed
to support efficient random access, which is expensive over HDFS. Hybrid data-
base-Hadoop architectures such as HadoopDB [5] take advantage of the available
indexes as well as the traditional database optimization techniques. HadoopDB
employs data partitioning schemes that allow part of the query evaluation to be
pushed into the database, thus reducing the required number of MapReduce cycles.
HadoopDB's storage and processing layer has also been extended [21] to include
RDF-3X [30], an RDF storage and retrieval system for SPARQL query support.
RDF data is partitioned into multiple nodes with some overlaps, and most query
processing is pushed into each single-node equipped with RDF-3X instead of
RDBMS. Some other works [15,33,40] have proposed RDF storage models and
SPARQL query translation algorithms that exploit distributed databases such as
HBase [1]. The MAPSIN [40] join algorithm uses HBase to selectively retrieve
mapping values for variables in a graph pattern and avoids the need for an expen-
sive reduce phase for join execution. H2RDF [33] indexes data into HBase with
a statistical profile on data sets, and the query engine adaptively selects the most
feasible join algorithm based on query selectivity and the inherent characteristics of
MapReduce and HBase systems. EAGRE [48] introduces an RDF data representa-
tion and layout schemes to efficiently locate RDF triples that match a graph pattern.
Additionally, adaptive scheduling strategies and a consulting protocol are used to
evaluate queries in a way that minimizes disk and network I/O costs, as well as the
total execution time.
Dataflow
specification and
compilation
Pig
Hive
Data
processing
MapReduce
DBMS
HBase
Data
storage
Database
HDFS
FIGURE 6.3
Hadoop-based data processing platforms.
 
Search WWH ::




Custom Search