Database Reference
In-Depth Information
Streaming provides an API for manipulating data streams that closely matches the
Spark Core's RDD API, making it easy for programmers to learn the project and
move between applications that manipulate data stored in memory, on disk, or arriv‐
ing in real time. Underneath its API, Spark Streaming was designed to provide the
same degree of fault tolerance, throughput, and scalability as Spark Core.
MLlib
Spark comes with a library containing common machine learning (ML) functionality,
called MLlib. MLlib provides multiple types of machine learning algorithms, includ‐
ing classification, regression, clustering, and collaborative filtering, as well as sup‐
porting functionality such as model evaluation and data import. It also provides
some lower-level ML primitives, including a generic gradient descent optimization
algorithm. All of these methods are designed to scale out across a cluster.
GraphX
GraphX is a library for manipulating graphs (e.g., a social network's friend graph)
and performing graph-parallel computations. Like Spark Streaming and Spark SQL,
GraphX extends the Spark RDD API, allowing us to create a directed graph with arbi‐
trary properties attached to each vertex and edge. GraphX also provides various oper‐
ators for manipulating graphs (e.g., subgraph and mapVertices ) and a library of
common graph algorithms (e.g., PageRank and triangle counting).
Cluster Managers
Under the hood, Spark is designed to efficiently scale up from one to many thousands
of compute nodes. To achieve this while maximizing flexibility, Spark can run over a
variety of cluster managers , including Hadoop YARN, Apache Mesos, and a simple
cluster manager included in Spark itself called the Standalone Scheduler. If you are
just installing Spark on an empty set of machines, the Standalone Scheduler provides
an easy way to get started; if you already have a Hadoop YARN or Mesos cluster,
however, Spark's support for these cluster managers allows your applications to also
run on them. Chapter 7 explores the different options and how to choose the correct
cluster manager.
Who Uses Spark, and for What?
Because Spark is a general-purpose framework for cluster computing, it is used for a
diverse range of applications. In the Preface we outlined two groups of readers that
this topic targets: data scientists and engineers. Let's take a closer look at each group
and how it uses Spark. Unsurprisingly, the typical use cases differ between the two,
Search WWH ::




Custom Search