Cookbook - Hadoop in Action

Databases Reference

In-Depth Information

is sorted

in total . Each output file (generated by one reducer) is already in sorted order; it would

be nice to also have all the records in part-00000 be smaller than records in part-

00001 , and part-00001 be smaller than part-00002 , and so forth. The key to doing

this is the partitioner

On the other hand, for some applications it's desirable that all output

operation in the framework.

The job of the partitioner is to deterministically assign a reducer

to each key. All

records of the same key are grouped and processed together in the reduce stage. An

important design requirement of the partitioner is to balance load across reducers;

no one reducer is given many more keys than other reducers. Without any prior

information about the distribution of keys, the default partitioner uses a hashing

function to uniformly assign keys to reducers. This often works well in distributing

work evenly across reducers, but the assignment is intentionally arbitrary and not

in any order. If we have prior knowledge that the keys are approximately uniformly

distributed, we can use a partitioner that assigns key ranges to each reducer and still

be certain that the reducers' loads are fairly balanced.

TIP The hash partitioner

can also fail to evenly distribute work if certain keys

take much more time to process than others. For example, in highly skewed

data sets, a significant number of records may have the same key. If possible,

you should use a combiner to lessen the load at the reduce phase by doing as

much preprocessing as possible at the map phase. In addition, you can also

choose to write a special partitioner to distribute keys unevenly in such a way

that it balances out the inherent skew of the data and its processing.

The TotalOrderPartitioner is a partitioner that ensures sortedness between

output partitions, not only within. Sorting of large-scale data (i.e., the TeraSort

benchmark) originally used a similar version of this class. This class takes a sequence

file with a sorted partition keyset and proceeds to partition keys in different ranges

to the reducers.

7.6

Summary

This chapter discussed many tools and techniques to make your Hadoop job more

user-friendly or make it interface better with other components of your data process-

ing infrastructure. The full extent of the capabilities available in a Hadoop job is doc-

umented in the Hadoop API: http://hadoop.apache.org/common/docs/current/

api/index.html. You may also want to check out additional abstractions such as Pig and

Hive to simplify your programming. We'll cover these tools in chapters 10 and 11.

If your role involves administrating a Hadoop cluster, you will find the tips on

managing a Hadoop cluster in the next chapter useful.

Search WWH ::

Custom Search

Home