Database Reference
In-Depth Information
— for example, to test for convergence in an iterative algorithm (see
Iterative Al-
There are a few ways of materializing a
PCollection
; the most direct way to accom-
plish this is to call
materialize()
, which returns an
Iterable
collection of its val-
ues. If the
PCollection
has not already been materialized, then Crunch will have to
run the pipeline to ensure that the objects in the
PCollection
have been computed and
Consider the following Crunch program for lowercasing lines in a text file:
Pipeline pipeline
=
new
MRPipeline
(
getClass
());
PCollection
<
String
>
lines
=
pipeline
.
readTextFile
(
inputPath
);
PCollection
<
String
>
lower
=
lines
.
parallelDo
(
new
ToLowerFn
(),
strings
());
Iterable
<
String
>
materialized
=
lower
.
materialize
();
for
(
String s
:
materialized
) {
// pipeline is run
System
.
out
.
println
(
s
);
}
pipeline
.
done
();
The lines from the text file are transformed using the
ToLowerFn
function, which is
defined separately so we can use it again later:
public class
ToLowerFn
extends
DoFn
<
String
,
String
> {
@Override
public
void
process
(
String input
,
Emitter
<
String
>
emitter
) {
emitter
.
emit
(
input
.
toLowerCase
());
}
}
The call to
materialize()
on the variable
lower
returns an
Iterable<String>
,
but it is not this method call that causes the pipeline to be run. It is only once an
Iter-
ator
is created from the
Iterable
(implicitly by the
for each
loop) that Crunch
runs the pipeline. When the pipeline has completed, the iteration can proceed over the ma-
terialized
PCollection
, and in this example the lowercase lines are printed to the con-
sole.
PTable
has a
materializeToMap()
method, which might be expected to behave in
a similar way to
materialize()
. However, there are two important differences. First,
since it returns a
Map<K, V>
rather than an iterator, the whole table is loaded into
memory at once, which should be avoided for large collections. Second, although a
PT-
able
is a multi-map, the
Map
interface does not support multiple values for a single key,