Database Reference
In-Depth Information
into SCOPE. Moreover, SCOPE expressions can use C# libraries where custom
C# classes can compute functions of scalar values, or manipulate whole rowsets.
A SCOPE script consists of a sequence of commands that are data transformation
operators that take one or more rowsets as input, perform some operation on the data
and output a rowset. Every rowset has a well-defined schema to which all its rows
must adhere. The SCOPE compiler parses the script, checks the syntax and resolves
names. The result of the compilation is an internal parse tree, which is then translated
to a physical execution plan. A physical execution plan is a specification of Cosmos
job, which describes a data flow DAG where each vertex is a program and each edge
represents a data channel. The translation into an execution plan is performed by
traversing the parse tree in a bottom-up manner. For each operator, SCOPE has an
associated default implementation rule. Many of the traditional optimization rules
from database systems are clearly also applicable in this new context, for exam-
ple, removing unnecessary columns, pushing down selection predicates  and pre-
aggregating when possible. However, the highly distributed execution environment
offers new opportunities and challenges, making it necessary to explicitly consider
the effects of large-scale parallelism during optimization. For example, choosing
the right partition scheme and deciding when to partition are crucial for finding an
optimal plan. It is also important to correctly reason about partitioning, grouping,
and sorting properties and their interaction to avoid unnecessary computations [139].
2.6.2 D ryaD /D ryaD linQ
Dryad is a general-purpose distributed execution engine introduced by Microsoft
for coarse-grain data-parallel applications [73]. A Dryad application combines com-
putational vertices with communication channels to form a dataflow graph. Dryad
runs the application by executing the vertices of this graph on a set of available
computers, communicating as appropriate through files, TCP pipes and shared-
memory FIFOs. The Dryad system allows the developer fine-grained control over
the communication graph as well as the subroutines that live at its vertices. A Dryad
application developer can specify an arbitrary directed acyclic graph to describe
the applications communication patterns and express the data transport mechanisms
(files, TCP pipes, and shared-memory FIFOs) between the computation vertices.
This direct specification of the graph gives the developer greater flexibility to eas-
ily compose basic common operations, leading to a distributed analogue of piping
together traditional Unix utilities such as grep, sort, and head.
Dryad is notable for allowing graph vertices (and computations in general) to use
an arbitrary number of inputs and outputs, while MapReduce restricts all computa-
tions to take a single input set and generate a single output set. The overall structure
of a Dryad job is determined by its communication flow. A job is a directed acyclic
graph where each vertex is a program and edges represent data channels. It is a
logical computation graph that is automatically mapped onto physical resources by
the runtime. At runtime each channel is used to transport a finite sequence of struc-
tured items. A Dryad job is coordinated by a process called the job manager that
runs either within the cluster or on a user's workstation with network access to the
cluster. The job manager contains the application-specific code to construct the job's
Search WWH ::




Custom Search