Database Reference
In-Depth Information
so if the table has multiple values for the same key, all but one will be lost in the returned
Map
.
To avoid these limitations, simply call
materialize()
on the table in order to obtain
an
Iterable<Pair<K, V>>
.
PObject
Another way to materialize a
PCollection
is to use
PObject
s. A
PObject<T>
is a
future
, a computation of a value of type
T
that may not have been completed at the time
when the
PObject
is created in the running program. The computed value can be re-
trieved by calling
getValue()
on the
PObject
, which will block until the computa-
tion is completed (by running the Crunch pipeline) before returning the value.
Calling
getValue()
on a
PObject
is analogous to calling
materialize()
on a
PCollection
, since both calls will trigger execution of the pipeline to materialize the
necessary collections. Indeed, we can rewrite the program to lowercase lines in a text file
to use a
PObject
as follows:
Pipeline pipeline
=
new
MRPipeline
(
getClass
());
PCollection
<
String
>
lines
=
pipeline
.
readTextFile
(
inputPath
);
PCollection
<
String
>
lower
=
lines
.
parallelDo
(
new
ToLowerFn
(),
strings
());
PObject
<
Collection
<
String
>>
po
=
lower
.
asCollection
();
for
(
String s
:
po
.
getValue
()
) {
// pipeline is run
System
.
out
.
println
(
s
);
}
pipeline
.
done
();
The
asCollection()
method converts a
PCollection<T>
into a regular Java
be deferred to a later point in the program's execution if necessary. In this case, we call
PObject
's
getValue()
immediately after getting the
PObject
so that we can iterate
over the resulting
Collection
.
WARNING
asCollection()
will materialize all the objects in the
PCollection
into memory, so you should
only call it on small
PCollection
instances, such as the results of a computation that contain only a
few objects. There is no such restriction on the use of
materialize()
, which iterates over the collec-
tion, rather than holding the entire collection in memory at once.