Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

into SCOPE. Moreover, SCOPE expressions can use C# libraries where custom

C# classes can compute functions of scalar values, or manipulate whole rowsets.

A SCOPE script consists of a sequence of commands that are data transformation

operators that take one or more rowsets as input, perform some operation on the data

and output a rowset. Every rowset has a well-defined schema to which all its rows

must adhere. The SCOPE compiler parses the script, checks the syntax and resolves

names. The result of the compilation is an internal parse tree, which is then translated

to a physical execution plan. A physical execution plan is a specification of Cosmos

job, which describes a data flow DAG where each vertex is a program and each edge

represents a data channel. The translation into an execution plan is performed by

traversing the parse tree in a bottom-up manner. For each operator, SCOPE has an

associated default implementation rule. Many of the traditional optimization rules

from database systems are clearly also applicable in this new context, for exam-

ple, removing unnecessary columns, pushing down selection predicates and pre-

aggregating when possible. However, the highly distributed execution environment

offers new opportunities and challenges, making it necessary to explicitly consider

the effects of large-scale parallelism during optimization. For example, choosing

the right partition scheme and deciding when to partition are crucial for finding an

optimal plan. It is also important to correctly reason about partitioning, grouping,

and sorting properties and their interaction to avoid unnecessary computations [139].

2.6.2 D ryaD /D ryaD linQ

Dryad is a general-purpose distributed execution engine introduced by Microsoft

for coarse-grain data-parallel applications [73]. A Dryad application combines com-

putational vertices with communication channels to form a dataflow graph. Dryad

runs the application by executing the vertices of this graph on a set of available

computers, communicating as appropriate through files, TCP pipes and shared-

memory FIFOs. The Dryad system allows the developer fine-grained control over

the communication graph as well as the subroutines that live at its vertices. A Dryad

application developer can specify an arbitrary directed acyclic graph to describe

the applications communication patterns and express the data transport mechanisms

(files, TCP pipes, and shared-memory FIFOs) between the computation vertices.

This direct specification of the graph gives the developer greater flexibility to eas-

ily compose basic common operations, leading to a distributed analogue of piping

together traditional Unix utilities such as grep, sort, and head.

Dryad is notable for allowing graph vertices (and computations in general) to use

an arbitrary number of inputs and outputs, while MapReduce restricts all computa-

tions to take a single input set and generate a single output set. The overall structure

of a Dryad job is determined by its communication flow. A job is a directed acyclic

graph where each vertex is a program and edges represent data channels. It is a

logical computation graph that is automatically mapped onto physical resources by

the runtime. At runtime each channel is used to transport a finite sequence of struc-

tured items. A Dryad job is coordinated by a process called the job manager that

runs either within the cluster or on a user's workstation with network access to the

cluster. The job manager contains the application-specific code to construct the job's

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home