Crunch - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

TIP

If your DoFn significantly changes the size of the PCollection it is operating on, you can override

its scaleFactor() method to give a hint to the Crunch planner about the estimated relative size of

the output, which may improve its efficiency.

FilterFn 's scaleFactor() method returns 0.5; in other words, the assumption is that implementa-

tions will filter out about half of the elements in a PCollection . You can override this method if your

filter function is significantly more or less selective than this.

There is an overloaded form of parallelDo() for generating a PTable from a

PCollection . Recall from the opening example that a PTable<K, V> is a multi-

map of key-value pairs; or, in the language of Java types, PTable<K, V> is a PCol-

lection<Pair<K, V>> , where Pair<K, V> is Crunch's pair class.

The following code creates a PTable by using a DoFn that turns an input string into a

key-value pair (the key is the length of the string, and the value is the string itself):

PTable < Integer , String > b = a . parallelDo (

new DoFn < String , Pair < Integer , String >>() {

@Override

public void process ( String input , Emitter < Pair < Integer , String >>

emitter ) {

emitter . emit ( Pair . of ( input . length (), input ));

}

}, tableOf ( ints (), strings ()));

assertEquals ( "{(6,cherry),(5,apple),(6,banana)}" , dump ( b ));

Extracting keys from a PCollection of values to form a PTable is a common enough

task that Crunch provides a method for it, called by() . This method takes a MapFn<S,

K> to map the input value S to its key K :

PTable < Integer , String > b = a . by ( new MapFn < String , Integer >() {

@Override

public Integer map ( String input ) {

return input . length ();

}

}, ints ());

assertEquals ( "{(6,cherry),(5,apple),(6,banana)}" , dump ( b ));

groupByKey()

The third primitive operation is groupByKey() , for bringing together all the values in a

PTable<K, V> that have the same key. This operation can be thought of as the MapRe-

Search WWH ::

Custom Search

Home