Database Reference
In-Depth Information
Spark is implemented in the Scala programming language* [108]. It is built on top
of Mesos [70], a cluster operating system that lets multiple parallel frameworks share a
cluster in a fine-grained manner and provides an API for applications to launch tasks on a
cluster. It provides isolation and efficient resource sharing across frameworks running on
the same cluster while giving each framework freedom to implement its own program-
ming model and fully control the execution of its jobs. Mesos uses two main abstractions:
tasks and slots . A task represents a unit of work. A slot represents a computing resource
in which a framework may run a task, such as a core and some associated memory on
a multicore machine. It employs the two-level scheduling mechanism. At the first level,
Mesos allocates slots between frameworks using fair sharing. At the second level, each
framework is responsible for dividing its work into tasks, selecting which tasks to run in
each slot. This lets frameworks perform application-specific optimizations. For example,
Spark's scheduler tries to send each task to one of its preferred locations using a tech-
nique called delay scheduling [134].
To use Spark, developers need to write a driver program that implements the high-
level control flow of their application and launches various operations in parallel.
Spark provides two main abstractions for parallel programming: resilient distributed
data sets and parallel operations on these data sets (invoked by passing a function to
apply on a data set). In particular, each RDD is represented by a Scala object, which
can be constructed in different ways:
From a file in a shared file system (e.g., HDFS).
By parallelizing a Scala collection (e.g., an array) in the driver program, which
means dividing it into a number of slices that will be sent to multiple nodes.
By transforming an existing RDD. A data set with elements of type A can
be transformed into a data set with elements of type B using an operation
called flatMap. .
By changing the persistence of an existing RDD. A user can alter the per-
sistence of an RDD through two actions:
The cache action leaves the data set lazy but hints that it should be kept in
memory after the first time it is computed because it will be reused.
The save action evaluates the data set and writes it to a distributed filesys-
tem such as HDFS. The saved version is used in future operations on it.
Different parallel operations can be performed on RDDs:
The reduce operation, which combines data set elements using an associa-
tive function to produce a result at the driver program
The collect operation, which sends all elements of the data set to the driver
program
The foreach operation, which passes each element through a user provided
function
* http://www.scala-lang.org/.
Search WWH ::




Custom Search