Database Reference
In-Depth Information
which creates a directory called
output
containing the partition files:
%
cat output/part-*
(1950,22)
(1949,111)
The
saveAsTextFile()
method also triggers a Spark job. The main difference is that
no value is returned, and instead the RDD is computed and its partitions are written to
files in the
output
directory.
Spark Applications, Jobs, Stages, and Tasks
As we've seen in the example, like MapReduce, Spark has the concept of a
job
. A Spark
job is more general than a MapReduce job, though, since it is made up of an arbitrary dir-
ected acyclic graph (DAG) of
stages
, each of which is roughly equivalent to a map or re-
duce phase in MapReduce.
Stages are split into
tasks
by the Spark runtime and are run in parallel on partitions of an
RDD spread across the cluster — just like tasks in MapReduce.
A job always runs in the context of an
application
(represented by a
SparkContext
in-
stance) that serves to group RDDs and shared variables. An application can run more than
one job, in series or in parallel, and provides the mechanism for a job to access an RDD
that was cached by a previous job in the same application. (We will see how to cache
RDDs in
Persistence
.
) An interactive Spark session, such as a
spark-shell
session, is just
an instance of an application.
A Scala Standalone Application
After working with the Spark shell to refine a program, you may want to package it into a
self-contained application that can be run more than once. The Scala program in
Example 19-1
shows how to do this.
Example 19-1. Scala application to find the maximum temperature, using Spark
import
org.apache.spark.SparkContext._
import
org.apache.spark.
{
SparkConf
,
SparkContext
}
object
MaxTemperature
{
def
main
(
args
:
Array
[
String
]) {
val
conf
= new
SparkConf
().
setAppName
(
"Max Temperature"
)
val
sc
= new
SparkContext
(
conf
)
sc
.
textFile
(
args
(
0
))
.
map
(
_
.
split
(
"\t"
))
.
filter
(
rec
=>
(
rec
(
1
) !=
"9999"
&&
rec
(
2
).
matches
(
"[01459]"
)))