Database Reference
In-Depth Information
type
S
to type
T
, and a
PType<T>
instance to describe the output type
T
. (
PType
s are
explained in more detail in the section
Types
.)
The following code snippet shows how to use
parallelDo()
to apply a string length
function to a
PCollection
of strings:
PCollection
<
String
>
a
=
MemPipeline
.
collectionOf
(
"cherry"
,
"apple"
,
"banana"
);
PCollection
<
Integer
>
b
=
a
.
parallelDo
(
new
DoFn
<
String
,
Integer
>() {
@Override
public
void
process
(
String input
,
Emitter
<
Integer
>
emitter
) {
emitter
.
emit
(
input
.
length
());
}
},
ints
());
assertEquals
(
"{6,5,6}"
,
dump
(
b
));
In this case, the output
PCollection
of integers has the same number of elements as
the input, so we could have used the
MapFn
subclass of
DoFn
for 1:1 mappings:
PCollection
<
Integer
>
b
=
a
.
parallelDo
(
new
MapFn
<
String
,
Integer
>() {
@Override
public
Integer
map
(
String input
) {
return
input
.
length
();
}
},
ints
());
assertEquals
(
"{6,5,6}"
,
dump
(
b
));
One common use of
parallelDo()
is for filtering out data that is not needed in later
processing steps. Crunch provides a
filter()
method for this purpose that takes a spe-
cial
DoFn
called
FilterFn
. Implementors need only implement the
accept()
meth-
od to indicate whether an element should be in the output. For example, this code retains
only those strings with an even number of characters:
PCollection
<
String
>
b
=
a
.
filter
(
new
FilterFn
<
String
>() {
@Override
public
boolean
accept
(
String input
) {
return
input
.
length
() %
2
==
0
;
// even
}
});
assertEquals
(
"{cherry,banana}"
,
dump
(
b
));
Notice that there is no
PType
in the method signature for
filter()
, since the output
PCollection
has the same type as the input.