Database Reference
In-Depth Information
TIP
If your DoFn significantly changes the size of the PCollection it is operating on, you can override
its scaleFactor() method to give a hint to the Crunch planner about the estimated relative size of
the output, which may improve its efficiency.
FilterFn 's scaleFactor() method returns 0.5; in other words, the assumption is that implementa-
tions will filter out about half of the elements in a PCollection . You can override this method if your
filter function is significantly more or less selective than this.
There is an overloaded form of parallelDo() for generating a PTable from a
PCollection . Recall from the opening example that a PTable<K, V> is a multi-
map of key-value pairs; or, in the language of Java types, PTable<K, V> is a PCol-
lection<Pair<K, V>> , where Pair<K, V> is Crunch's pair class.
The following code creates a PTable by using a DoFn that turns an input string into a
key-value pair (the key is the length of the string, and the value is the string itself):
PTable < Integer , String > b = a . parallelDo (
new DoFn < String , Pair < Integer , String >>() {
@Override
public void process ( String input , Emitter < Pair < Integer , String >>
emitter ) {
emitter . emit ( Pair . of ( input . length (), input ));
}
}, tableOf ( ints (), strings ()));
assertEquals ( "{(6,cherry),(5,apple),(6,banana)}" , dump ( b ));
Extracting keys from a PCollection of values to form a PTable is a common enough
task that Crunch provides a method for it, called by() . This method takes a MapFn<S,
K> to map the input value S to its key K :
PTable < Integer , String > b = a . by ( new MapFn < String , Integer >() {
@Override
public Integer map ( String input ) {
return input . length ();
}
}, ints ());
assertEquals ( "{(6,cherry),(5,apple),(6,banana)}" , dump ( b ));
groupByKey()
The third primitive operation is groupByKey() , for bringing together all the values in a
PTable<K, V> that have the same key. This operation can be thought of as the MapRe-
Search WWH ::




Custom Search