Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management - page 81

Database Reference

In-Depth Information

MRQL queries to an algebra that is then translated to physical plans using cost-based

optimizations. In particular, the query plans are represented trees that are evaluated

using a plan interpreter where each physical operator is implemented with a single

MapReduce job that is parameterized by the functional parameters of the physical

operator. The data fragmentation technique of MRQL is built on top of the general

Hadoop XML input format, which is based on a single XML tag name. Hence, given

a data split of an XML document, Hadoop's input format allows reading the docu-

ment as a stream of string fragments, so that each string will contain a single com-

plete element that has the requested XML tag name. ChuQL [80] is another language

that has been proposed to support distributed XML processing using the MapReduce

framework. It presents a MapReduce-based extension for the syntax, grammar, and

semantics of XQuery [21], the standard W3C language for querying XML docu-

ments. In particular, the ChuQL implementation takes care of distributing the com-

putation to multiple XQuery engines running in Hadoop nodes, as described by one

or more ChuQL MapReduce expressions. Figure 2.18 illustrates the representation of

the word count example program in the ChuQL language using its extended expres-

sions where the MapReduce expression is used to describe a MapReduce job. The

input and output clauses are, respectively, used to read and write onto HDFS. The

rr and rw clauses are, respectively, used for describing the record reader and writer.

The map and reduce clauses represent the standard map and reduce phases of the

framework where they process XML values or key/value pairs of XML values to

match the MapReduce model, which are specified using XQuery expressions.

Some research efforts have been proposed for achieving scalable RDF processing

using the MapReduce framework. PigSPARQL [118] is a system that has been intro-

duced to process SPARQL queries using the MapReduce framework by translating

them into Pig Latin programs where each Pig Latin program is executed by a series

of MapReduce jobs on a Hadoop cluster. Myung et al. [105] have presented a pre-

liminary algorithm for SPARQL graph pattern matching by adopting the traditional

multiway join of the RDF triples and selecting a good join-key to avoid unnecessary

iterations. Husain et al. [72] have described a storage scheme for RDF data using

HDFS where the input data are partitioned into multiple files using two main steps:

(1) The Predicate Split , which partitions the RDF triples according to their predi-

cates. (2) The Predicate Object Split (POS), which uses the explicit type information

in the RDF triples to denote that a resource is an instance of a specific class while the

remaining predicate files are partitioned according to the type of their objects. Using

summary statistics for estimating the selectivities of join operations, the authors

FIGURE 2.18 The word count example program in ChuQL. (From S. Khatchadourian

et al., Having a ChuQL at XML on the cloud, in AMW , 2011.)

Next Page

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home