Database Reference
In-Depth Information
TIP
If your
DoFn
significantly changes the size of the
PCollection
it is operating on, you can override
its
scaleFactor()
method to give a hint to the Crunch planner about the estimated relative size of
the output, which may improve its efficiency.
FilterFn
's
scaleFactor()
method returns 0.5; in other words, the assumption is that implementa-
tions will filter out about half of the elements in a
PCollection
. You can override this method if your
filter function is significantly more or less selective than this.
There is an overloaded form of
parallelDo()
for generating a
PTable
from a
PCollection
. Recall from the opening example that a
PTable<K, V>
is a multi-
map of key-value pairs; or, in the language of Java types,
PTable<K, V>
is a
PCol-
lection<Pair<K, V>>
, where
Pair<K, V>
is Crunch's pair class.
The following code creates a
PTable
by using a
DoFn
that turns an input string into a
key-value pair (the key is the length of the string, and the value is the string itself):
PTable
<
Integer
,
String
>
b
=
a
.
parallelDo
(
new
DoFn
<
String
,
Pair
<
Integer
,
String
>>() {
@Override
public
void
process
(
String input
,
Emitter
<
Pair
<
Integer
,
String
>>
emitter
) {
emitter
.
emit
(
Pair
.
of
(
input
.
length
(),
input
));
}
},
tableOf
(
ints
(),
strings
()));
assertEquals
(
"{(6,cherry),(5,apple),(6,banana)}"
,
dump
(
b
));
Extracting keys from a
PCollection
of values to form a
PTable
is a common enough
task that Crunch provides a method for it, called
by()
. This method takes a
MapFn<S,
K>
to map the input value
S
to its key
K
:
PTable
<
Integer
,
String
>
b
=
a
.
by
(
new
MapFn
<
String
,
Integer
>() {
@Override
public
Integer
map
(
String input
) {
return
input
.
length
();
}
},
ints
());
assertEquals
(
"{(6,cherry),(5,apple),(6,banana)}"
,
dump
(
b
));
groupByKey()
The third primitive operation is
groupByKey()
, for bringing together all the values in a
PTable<K, V>
that have the same key. This operation can be thought of as the MapRe-