Database Reference
In-Depth Information
of optimization rules to merge multiple jobs, which otherwise would have been run
independently without YSmart, into a common job. Therefore, it provides a Common
MapReduce Framework (CMF) that allows multiple types of jobs (e.g., a join job and
an aggregation job) to be executed in a common job. In a query plan tree, YSmart
detects three type of intraquery correlations that is defined based on the key/value
pair model of the MapReduce framework:
1. Input correlation : Multiple nodes have input correlation if their input rela-
tion sets are not disjoint.
2. Transit correlation : Multiple nodes have transit correlation if they have the
input correlation and the same partition key.
3. Job flow correlation : A node has job flow correlation with one of its child
nodes if it has the same partition key as that child node.
On the other hand, the HadoopToSQL system [74] has been presented as an SQL
translator for MapReduce jobs. It relies on a static analysis component that uses sym-
bolic execution to analyze the Java code of a MapReduce query and transforms que-
ries to make use of SQL's indexing, aggregation, and grouping features. In particular,
HadoopToSQL applies two algorithms that generate SQL code from MapReduce
queries. The first algorithm can extract input set restrictions from MapReduce que-
ries and the other can translate entire MapReduce queries into equivalent SQL que-
ries. Both algorithms function by finding all control flow paths through map and
reduce functions, using symbolic execution to determine the behavior of each path,
and then mapping this behavior onto possible SQL queries. This information is then
used either to generate input restrictions, which avoid scanning the entire data set,
or to generate equivalent SQL queries, which take advantage of SQL grouping and
aggregation features. However, HadoopToSQL has reported some difficulties on the
ability of analyzing MapReduce programs with loops and unknown method calls. It
also unable to analyze across multiple MapReduce instances.
2.4.7 sQl/m aP r eDuCe
In general, a user-defined function (UDF) is a powerful database feature that allows
users to customize database functionality. Friedman et al. [55] introduced the SQL/
MapReduce (SQL/MR) UDF framework which is designed to facilitate parallel
computation of procedural functions across hundreds of servers working together
as a single relational database. The framework is implemented as part of the Aster
Data Systems* nCluster shared-nothing relational database. The framework lever-
ages ideas from the MapReduce programming paradigm to provide users with a
straightforward API through which they can implement a UDF in the language of
their choice. Moreover, it allows maximum flexibility as the output schema of the
UDF is specified by the function itself at query plan-time. This means that a SQL/
MR function is polymorphic as it can process arbitrary input because its behavior as
well as output schema are dynamically determined by information available at query
* http://www.asterdata.com/.
Search WWH ::




Custom Search