Database Reference
In-Depth Information
>>>
pythonLines
.
first
()
u'## Interactive Python Shell'
Example 2-5. Scala filtering example
scala
>
val
lines
=
sc
.
textFile
(
"README.md"
)
// Create an RDD called lines
lines
:
spark.RDD
[
String
]
=
MappedRDD
[
...
]
scala
>
val
pythonLines
=
lines
.
filter
(
line
=>
line
.
contains
(
"Python"
))
pythonLines
:
spark.RDD
[
String
]
=
FilteredRDD
[
...
]
scala
>
pythonLines
.
first
()
res0
:
String
=
#
#
Interactive
Python
Shell
Passing Functions to Spark
If you are unfamiliar with the
lambda
or
=>
syntax in Examples
2-4
and
2-5
, it is a
shorthand way to define functions inline in Python and Scala. When using Spark in
these languages, you can also define a function separately and then pass its name to
Spark. For example, in Python:
def
hasPython
(
line
):
return
"Python"
in
line
pythonLines
=
lines
.
filter
(
hasPython
)
Passing functions to Spark is also possible in Java, but in this case they are defined as
classes, implementing an interface called
Function
. For example:
JavaRDD
<
String
>
pythonLines
=
lines
.
filter
(
new
Function
<
String
,
Boolean
>()
{
Boolean
call
(
String
line
)
{
return
line
.
contains
(
"Python"
);
}
}
);
Java 8 introduces shorthand syntax called
lambdas
that looks similar to Python and
Scala. Here is how the code would look with this syntax:
JavaRDD
<
String
>
pythonLines
=
lines
.
filter
(
line
->
line
.
contains
(
"Python"
));
We discuss passing functions further in
“Passing Functions to Spark” on page 30
.
While we will cover the Spark API in more detail later, a lot of its magic is that
function-based operations like
filter
also
parallelize across the cluster. That is,
Spark automatically takes your function (e.g.,
line.contains("Python")
) and ships
it to executor nodes. Thus, you can write code in a single driver program and auto‐
matically have parts of it run on multiple nodes.
Chapter 3
covers the RDD API in
detail.