Getting Up and Running with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Chapter 1. Getting Up and Running with

Spark

Apache Spark is a framework for distributed computing; this framework aims to make it

simpler to write programs that run in parallel across many nodes in a cluster of computers.

It tries to abstract the tasks of resource scheduling, job submission, execution, tracking, and

communication between nodes, as well as the low-level operations that are inherent in par-

allel data processing. It also provides a higher level API to work with distributed data. In

this way, it is similar to other distributed processing frameworks such as Apache Hadoop;

however, the underlying architecture is somewhat different.

Spark began as a research project at the University of California, Berkeley. The university

was focused on the use case of distributed machine learning algorithms. Hence, it is de-

signed from the ground up for high performance in applications of an iterative nature,

where the same data is accessed multiple times. This performance is achieved primarily

through caching datasets in memory, combined with low latency and overhead to launch

parallel computation tasks. Together with other features such as fault tolerance, flexible

distributed-memory data structures, and a powerful functional API, Spark has proved to be

broadly useful for a wide range of large-scale data processing tasks, over and above ma-

chine learning and iterative analytics.

Note

For more background on Spark, including the research papers underlying Spark's develop-

ment, see the project's history page at http://spark.apache.org/community.html#history .

Spark runs in four modes:

• The standalone local mode, where all Spark processes are run within the same

Java Virtual Machine ( JVM ) process

• The standalone cluster mode, using Spark's own built-in job-scheduling framework

• Using Mesos, a popular open source cluster-computing framework

• Using YARN (commonly referred to as NextGen MapReduce), a Hadoop-related

cluster-computing and resource-scheduling framework

In this chapter, we will:

Search WWH ::

Custom Search

Home