Database Reference
In-Depth Information
whether a given function is a transformation or an action, you can look at its return
type: transformations return RDDs, whereas actions return some other data type.
Transformations
Transformations are operations on RDDs that return a new RDD. As discussed in
“Lazy Evaluation” on page 29
, transformed RDDs are computed lazily, only when you
use them in an action. Many transformations are
element-wise
; that is, they work on
one element at a time; but this is not true for all transformations.
As an example, suppose that we have a logfile,
log.txt
, with a number of messages,
and we want to select only the error messages. We can use the
filter()
transforma‐
tion seen before. This time, though, we'll show a filter in all three of Spark's language
APIs (Examples
3-11
through
3-13
).
Example 3-11. filter() transformation in Python
inputRDD
=
sc
.
textFile
(
"log.txt"
)
errorsRDD
=
inputRDD
.
filter
(
lambda
x
:
"error"
in
x
)
Example 3-12. filter() transformation in Scala
val
inputRDD
=
sc
.
textFile
(
"log.txt"
)
val
errorsRDD
=
inputRDD
.
filter
(
line
=>
line
.
contains
(
"error"
))
Example 3-13. filter() transformation in Java
JavaRDD
<
String
>
inputRDD
=
sc
.
textFile
(
"log.txt"
);
JavaRDD
<
String
>
errorsRDD
=
inputRDD
.
filter
(
new
Function
<
String
,
Boolean
>()
{
public
Boolean
call
(
String
x
)
{
return
x
.
contains
(
"error"
);
}
}
});
Note that the
filter()
operation does not mutate the existing
inputRDD
. Instead, it
returns a pointer to an entirely new RDD.
inputRDD
can still be reused later in the
program—for instance, to search for other words. In fact, let's use
inputRDD
again to
search for lines with the word
warning
in them. Then, we'll use another transforma‐
tion,
union()
, to print out the number of lines that contained either
error
or
warning
.
We show Python in
Example 3-14
, but the
union()
function is identical in all three
languages.