Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

communication graph along with library code to schedule the work across the avail-

able resources. All data is sent directly between vertices, and thus, the job manager

is only responsible for control decisions and is not a bottleneck for any data transfers.

Therefore, much of the simplicity of the Dryad scheduler and fault-tolerance model

come from the assumption that vertices are deterministic.

Dryad has its own high-level language called DryadLINQ [133]. It generalizes

execution environments such as SQL and MapReduce in two ways: (1) adopting an

expressive data model of strongly typed .NET objects and (2) supporting general-

purpose imperative and declarative operations on data sets within a traditional high-

level programming language. DryadLINQ* exploits LINQ (Language INtegrated

Quer y, † a set of .NET constructs for programming with data sets) to provide a pow-

erful hybrid of declarative and imperative programming. The system is designed to

provide flexible and efficient distributed computation in any LINQ-enabled program-

ming language including C#, VB, and F#. ‡ Objects in DryadLINQ data sets can be of

any .NET type, making it easy to compute with data such as image patches, vectors,

and matrices. In practice, a DryadLINQ program is a sequential program composed

of LINQ expressions that perform arbitrary side-effect-free transformations on data

sets and can be written and debugged using standard .NET development tools. The

DryadLINQ system automatically translates the data-parallel portions of the pro-

gram into a distributed execution plan, which is then passed to the Dryad execution

platform. Figure 2.22 illustrates the flow of execution of a DryadLINQ program

according to the following steps:

1. When a .NET user application runs, it creates a DryadLINQ expression object.

2. The application triggers a data-parallel execution where the expression

object is handed to DryadLINQ.

3. DryadLINQ compiles the LINQ expression into a distributed Dryad execu-

tion plan. In particular, it performs the following tasks:

a. Decomposing the expression into subexpressions where each expres-

sion can to be assigned to run in a separate Dryad vertex

b. Generating the code and static data for the remote Dryad vertices

c. Generating the serialization code for the required data types

4. DryadLINQ invokes a custom Dryad job manager.

5. The job manager creates the job graph and schedules the vertices as

resources become available.

6. Each Dryad vertex executes a vertex-specific program as created in Step 3(b).

7. When the Dryad job completes successfully it writes the data to the output

table(s).

8. The job manager process terminates and returns control back to DryadLINQ,

which creates objects encapsulating the outputs of the execution. These objects

may be used as inputs to subsequent expressions in the user program.

* http://research.microsoft.com/en-us/projects/dryadlinq/.

† http://msdn.microsoft.com/en-us/netframework/aa904594.aspx.

‡ http://research.microsoft.com/en-us/um/cambridge/projects/fsharp/.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home