Hadoop I/O - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

}

return TEXT_COMPARATOR . compare ( b1 , s1 + firstL1 , l1 - firstL1 ,

b2 , s2 + firstL2 , l2 - firstL2 );

} catch ( IOException e ) {

throw new IllegalArgumentException ( e );

}

static {

WritableComparator . define ( TextPair . class , new Comparator ());

}

We actually subclass WritableComparator rather than implementing RawCompar-

ator directly, since it provides some convenience methods and default implementations.

The subtle part of this code is calculating firstL1 and firstL2 , the lengths of the

first Text field in each byte stream. Each is made up of the length of the variable-length

integer (returned by decodeVIntSize() on WritableUtils ) and the value it is

encoding (returned by readVInt() ).

The static block registers the raw comparator so that whenever MapReduce sees the Tex-

tPair class, it knows to use the raw comparator as its default comparator.

Custom comparators

As you can see with TextPair , writing raw comparators takes some care because you

have to deal with details at the byte level. It is worth looking at some of the implementa-

tions of Writable in the org.apache.hadoop.io package for further ideas if you

need to write your own. The utility methods on WritableUtils are very handy, too.

Custom comparators should also be written to be RawComparator s, if possible. These

are comparators that implement a different sort order from the natural sort order defined

by the default comparator. Example 5-9 shows a comparator for TextPair , called

FirstComparator , that considers only the first string of the pair. Note that we over-

ride the compare() method that takes objects so both compare() methods have the

same semantics.

We will make use of this comparator in Chapter 9 , when we look at joins and secondary

sorting in MapReduce (see Joins ) .

Example 5-9. A custom RawComparator for comparing the first field of TextPair byte rep-

resentations

Search WWH ::

Custom Search

Home