Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

of optimization rules to merge multiple jobs, which otherwise would have been run

independently without YSmart, into a common job. Therefore, it provides a Common

MapReduce Framework (CMF) that allows multiple types of jobs (e.g., a join job and

an aggregation job) to be executed in a common job. In a query plan tree, YSmart

detects three type of intraquery correlations that is defined based on the key/value

pair model of the MapReduce framework:

1. Input correlation : Multiple nodes have input correlation if their input rela-

tion sets are not disjoint.

2. Transit correlation : Multiple nodes have transit correlation if they have the

input correlation and the same partition key.

3. Job flow correlation : A node has job flow correlation with one of its child

nodes if it has the same partition key as that child node.

On the other hand, the HadoopToSQL system [74] has been presented as an SQL

translator for MapReduce jobs. It relies on a static analysis component that uses sym-

bolic execution to analyze the Java code of a MapReduce query and transforms que-

ries to make use of SQL's indexing, aggregation, and grouping features. In particular,

HadoopToSQL applies two algorithms that generate SQL code from MapReduce

queries. The first algorithm can extract input set restrictions from MapReduce que-

ries and the other can translate entire MapReduce queries into equivalent SQL que-

ries. Both algorithms function by finding all control flow paths through map and

reduce functions, using symbolic execution to determine the behavior of each path,

and then mapping this behavior onto possible SQL queries. This information is then

used either to generate input restrictions, which avoid scanning the entire data set,

or to generate equivalent SQL queries, which take advantage of SQL grouping and

aggregation features. However, HadoopToSQL has reported some difficulties on the

ability of analyzing MapReduce programs with loops and unknown method calls. It

also unable to analyze across multiple MapReduce instances.

2.4.7 sQl/m aP r eDuCe

In general, a user-defined function (UDF) is a powerful database feature that allows

users to customize database functionality. Friedman et al. [55] introduced the SQL/

MapReduce (SQL/MR) UDF framework which is designed to facilitate parallel

computation of procedural functions across hundreds of servers working together

as a single relational database. The framework is implemented as part of the Aster

Data Systems* nCluster shared-nothing relational database. The framework lever-

ages ideas from the MapReduce programming paradigm to provide users with a

straightforward API through which they can implement a UDF in the language of

their choice. Moreover, it allows maximum flexibility as the output schema of the

UDF is specified by the function itself at query plan-time. This means that a SQL/

MR function is polymorphic as it can process arbitrary input because its behavior as

well as output schema are dynamically determined by information available at query

* http://www.asterdata.com/.

Search WWH ::

Custom Search

Home