Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

Spark is implemented in the Scala programming language* [108]. It is built on top

of Mesos [70], a cluster operating system that lets multiple parallel frameworks share a

cluster in a fine-grained manner and provides an API for applications to launch tasks on a

cluster. It provides isolation and efficient resource sharing across frameworks running on

the same cluster while giving each framework freedom to implement its own program-

ming model and fully control the execution of its jobs. Mesos uses two main abstractions:

tasks and slots . A task represents a unit of work. A slot represents a computing resource

in which a framework may run a task, such as a core and some associated memory on

a multicore machine. It employs the two-level scheduling mechanism. At the first level,

Mesos allocates slots between frameworks using fair sharing. At the second level, each

framework is responsible for dividing its work into tasks, selecting which tasks to run in

each slot. This lets frameworks perform application-specific optimizations. For example,

Spark's scheduler tries to send each task to one of its preferred locations using a tech-

nique called delay scheduling [134].

To use Spark, developers need to write a driver program that implements the high-

level control flow of their application and launches various operations in parallel.

Spark provides two main abstractions for parallel programming: resilient distributed

data sets and parallel operations on these data sets (invoked by passing a function to

apply on a data set). In particular, each RDD is represented by a Scala object, which

can be constructed in different ways:

•

From a file in a shared file system (e.g., HDFS).

•

By parallelizing a Scala collection (e.g., an array) in the driver program, which

means dividing it into a number of slices that will be sent to multiple nodes.

•

By transforming an existing RDD. A data set with elements of type A can

be transformed into a data set with elements of type B using an operation

called flatMap. .

•

By changing the persistence of an existing RDD. A user can alter the per-

sistence of an RDD through two actions:

The cache action leaves the data set lazy but hints that it should be kept in

memory after the first time it is computed because it will be reused.

The save action evaluates the data set and writes it to a distributed filesys-

tem such as HDFS. The saved version is used in future operations on it.

Different parallel operations can be performed on RDDs:

•

The reduce operation, which combines data set elements using an associa-

tive function to produce a result at the driver program

•

The collect operation, which sends all elements of the data set to the driver

program

•

The foreach operation, which passes each element through a user provided

function

* http://www.scala-lang.org/.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home