Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

Fig. 9.14 Basic syntax of

SQL/MR query function

SELECT ...

FROM functionname(

ON table-or-query

[PARTITION BY expr, ...]

[ORDER BY expr, ...]

[clausename(arg, ...) ...]

)

together as a single relational database. The framework is implemented as part

of the Aster Data Systems [ 13 ] nCluster shared-nothing relational database. The

framework leverages ideas from the MapReduce programming paradigm to provide

users with a straightforward API through which they can implement a UDF in the

language of their choice. Moreover, it allows maximum flexibility as the output

schema of the UDF is specified by the function itself at query plan-time. This means

that a SQL/MR function is polymorphic as it can process arbitrary input because

its behavior as well as output schema are dynamically determined by information

available at query plan-time. This also increases reusability as the same SQL/MR

function can be used on inputs with many different schemas or with different user-

specified parameters. In particular, SQL/MR allows the user to write custom-defined

functions in any programming language and insert them into queries that leverage

traditional SQL functionality. A SQL/MR function is defined in a manner that is

similar to MapReduce's map and reduce functions.

The syntax for using a SQL/MR function is depicted in Fig. 9.14 where the

SQL/MR function invocation appears in the SQL FROM clause and consists of the

function name followed by a set of clauses that are enclosed in parentheses. The ON

clause specifies the input to the invocation of the SQL/MR function. It is important

to note that the input schema to the SQL/MR function is specified implicitly at query

plan-time in the form of the output schema for the query used in the ON clause.

In practice, a SQL/MR function can be either a mapper ( Row function) or a

reducer ( Partition function). The definitions of row and partition functions ensure

that they can be executed in parallel in a scalable manner. In the Row Function , each

row from the input table or query will be operated on by exactly one instance of

the SQL/MR function. Semantically, each row is processed independently, allowing

the execution engine to control parallelism. For each input row, the row function

may emit zero or more rows. In the Partition Function , each group of rows as

defined by the PARTITION BY clause will be operated on by exactly one instance

of the SQL/MR function. If the ORDER BY clause is provided, the rows within

each partition are provided to the function instance in the specified sort order.

Semantically, each partition is processed independently, allowing parallelization by

the execution engine at the level of a partition. For each input partition, the SQL/MR

partition function may output zero or more rows.

Search WWH ::

Custom Search

Home