Database Reference
In-Depth Information
This command produces 30 output files, each of which is sorted. However, there is no
easy way to combine the files (by concatenation, for example, in the case of plain-text
files) to produce a globally sorted file.
For many applications, this doesn't matter. For example, having a partially sorted set of
files is fine when you want to do lookups by key. The SortByTemperat-
ureToMapFile and LookupRecordsByTemperature classes in this topic's ex-
ample code explore this idea. By using a map file instead of a sequence file, it's possible
to first find the relevant partition that a key belongs in (using the partitioner), then to do
an efficient lookup of the record within the map file partition.
Total Sort
How can you produce a globally sorted file using Hadoop? The naive answer is to use a
single partition. [ 63 ] But this is incredibly inefficient for large files, because one machine
has to process all of the output, so you are throwing away the benefits of the parallel ar-
chitecture that MapReduce provides.
Instead, it is possible to produce a set of sorted files that, if concatenated, would form a
globally sorted file. The secret to doing this is to use a partitioner that respects the total or-
der of the output. For example, if we had four partitions, we could put keys for temperat-
ures less than -10°C in the first partition, those between -10°C and 0°C in the second,
those between 0°C and 10°C in the third, and those over 10°C in the fourth.
Although this approach works, you have to choose your partition sizes carefully to ensure
that they are fairly even, so job times aren't dominated by a single reducer. For the parti-
tioning scheme just described, the relative sizes of the partitions are as follows:
Temperature range < -10°C [-10°C, 0°C) [0°C, 10°C) >= 10°C
Proportion of records 11%
13%
17%
59%
These partitions are not very even. To construct more even partitions, we need to have a
better understanding of the temperature distribution for the whole dataset. It's fairly easy
to write a MapReduce job to count the number of records that fall into a collection of tem-
perature buckets. For example, Figure 9-1 shows the distribution for buckets of size 1°C,
where each point on the plot corresponds to one bucket.
Although we could use this information to construct a very even set of partitions, the fact
that we needed to run a job that used the entire dataset to construct them is not ideal. It's
possible to get a fairly even set of partitions by sampling the key space. The idea behind
sampling is that you look at a small subset of the keys to approximate the key distribution,
Search WWH ::




Custom Search