Databases Reference
In-Depth Information
is sorted
in total . Each output file (generated by one reducer) is already in sorted order; it would
be nice to also have all the records in part-00000 be smaller than records in part-
00001 , and part-00001 be smaller than part-00002 , and so forth. The key to doing
this is the partitioner
On the other hand, for some applications it's desirable that all output
operation in the framework.
The job of the partitioner is to deterministically assign a reducer
to each key. All
records of the same key are grouped and processed together in the reduce stage. An
important design requirement of the partitioner is to balance load across reducers;
no one reducer is given many more keys than other reducers. Without any prior
information about the distribution of keys, the default partitioner uses a hashing
function to uniformly assign keys to reducers. This often works well in distributing
work evenly across reducers, but the assignment is intentionally arbitrary and not
in any order. If we have prior knowledge that the keys are approximately uniformly
distributed, we can use a partitioner that assigns key ranges to each reducer and still
be certain that the reducers' loads are fairly balanced.
TIP The hash partitioner
can also fail to evenly distribute work if certain keys
take much more time to process than others. For example, in highly skewed
data sets, a significant number of records may have the same key. If possible,
you should use a combiner to lessen the load at the reduce phase by doing as
much preprocessing as possible at the map phase. In addition, you can also
choose to write a special partitioner to distribute keys unevenly in such a way
that it balances out the inherent skew of the data and its processing.
The TotalOrderPartitioner is a partitioner that ensures sortedness between
output partitions, not only within. Sorting of large-scale data (i.e., the TeraSort
benchmark) originally used a similar version of this class. This class takes a sequence
file with a sorted partition keyset and proceeds to partition keys in different ranges
to the reducers.
7.6
Summary
This chapter discussed many tools and techniques to make your Hadoop job more
user-friendly or make it interface better with other components of your data process-
ing infrastructure. The full extent of the capabilities available in a Hadoop job is doc-
umented in the Hadoop API: http://hadoop.apache.org/common/docs/current/
api/index.html. You may also want to check out additional abstractions such as Pig and
Hive to simplify your programming. We'll cover these tools in chapters 10 and 11.
If your role involves administrating a Hadoop cluster, you will find the tips on
managing a Hadoop cluster in the next chapter useful.
 
Search WWH ::




Custom Search