Introduction to Data Analysis with Spark - Learning Spark

Database Reference

In-Depth Information

Streaming provides an API for manipulating data streams that closely matches the

Spark Core's RDD API, making it easy for programmers to learn the project and

move between applications that manipulate data stored in memory, on disk, or arriv‐

ing in real time. Underneath its API, Spark Streaming was designed to provide the

same degree of fault tolerance, throughput, and scalability as Spark Core.

MLlib

Spark comes with a library containing common machine learning (ML) functionality,

called MLlib. MLlib provides multiple types of machine learning algorithms, includ‐

ing classification, regression, clustering, and collaborative filtering, as well as sup‐

porting functionality such as model evaluation and data import. It also provides

some lower-level ML primitives, including a generic gradient descent optimization

algorithm. All of these methods are designed to scale out across a cluster.

GraphX

GraphX is a library for manipulating graphs (e.g., a social network's friend graph)

and performing graph-parallel computations. Like Spark Streaming and Spark SQL,

GraphX extends the Spark RDD API, allowing us to create a directed graph with arbi‐

trary properties attached to each vertex and edge. GraphX also provides various oper‐

ators for manipulating graphs (e.g., subgraph and mapVertices ) and a library of

common graph algorithms (e.g., PageRank and triangle counting).

Cluster Managers

Under the hood, Spark is designed to efficiently scale up from one to many thousands

of compute nodes. To achieve this while maximizing flexibility, Spark can run over a

variety of cluster managers , including Hadoop YARN, Apache Mesos, and a simple

cluster manager included in Spark itself called the Standalone Scheduler. If you are

just installing Spark on an empty set of machines, the Standalone Scheduler provides

an easy way to get started; if you already have a Hadoop YARN or Mesos cluster,

however, Spark's support for these cluster managers allows your applications to also

run on them. Chapter 7 explores the different options and how to choose the correct

cluster manager.

Who Uses Spark, and for What?

Because Spark is a general-purpose framework for cluster computing, it is used for a

diverse range of applications. In the Preface we outlined two groups of readers that

this topic targets: data scientists and engineers. Let's take a closer look at each group

and how it uses Spark. Unsurprisingly, the typical use cases differ between the two,

Search WWH ::

Custom Search

Home