Databases Reference
In-Depth Information
In the Hadoop MapReduce framework, a default partitioner is defi ned. The default partitioner is
HashPartitioner . HashPartitioner uses the hashCode function value of the keys. Therefore,
' hashCode value ' modulo ' number of partitions '
n (where n is the number used to distribute the
key/value pairs across the partitions).
Hadoop uses an interface called Partitioner to determine which partition a key/value pair emitted by
a map task goes to. The number of partitions, and therefore the number of reduce tasks, are known
when a job starts. The Partitioner interface is as follows:
public interface Partitioner<K, V> extends JobConfigurable {
int getPartition(K key, V value, int numPartitions);
The getPartition method takes the key, value, and the number of partitions as arguments and
returns the partition number, which identifi es the partition the key/value is sent to. For any two keys,
k 1 and k 2, if k 1 and k 2 are equal then the partition number returned by getPartition is the same.
If partitioning is not balanced using emitted key/value pairs there could be a load imbalance or over
partitioning, both of which are not effi cient. When a few reducers take on most of the load and
others remain idle, load imbalance occurs. Imbalance leads to increased latency. Machines and disks
under full load also tend to become slower and hit boundary conditions where effi ciency is reduced.
Load imbalance causes some reducers to reach these full states.
You know from the earlier illustration of Amdahl's Law that any parallel process optimization is
limited by the longest serial task. In partitioned MapReduce processing, a serial longer running
execution can form a bottleneck. It can also lead to sequential waits because reduce and grouping
tasks complete the entire process only when all constituent key/value pairs are processed.
SCHEDULING IN HETEROGENEOUS ENVIRONMENTS
Hadoop's default simple scheduling algorithm compares each task's progress to the average progress
to schedule jobs. The default scheduler assumes the following:
Nodes perform work at about the same rate
Tasks progress at a constant rate throughout
In heterogeneous environments, this default simple speculative algorithm does not perform
optimally. Therefore, improvements have been made specifi cally to address the problems in
heterogeneous environments.
The Longest Approximate Time to End (LATE) scheduler is an improvement on the default
Hadoop scheduler. The LATE scheduler launches speculative tasks only on fast nodes. It also puts
a speculative cap by limiting the number of tasks that are speculated. Also, a slow task threshold
determines whether a task is slow enough to get speculated.
Although the LATE scheduler is an improvement on the default Hadoop scheduler, both these
schedulers compute the progress of tasks in a static manner. SAMR, a self-adaptive MapReduce
scheduling algorithm, outperforms both the default and LATE schedulers in heterogeneous
environments. You can read more about SAMR in a paper titled “SAMR: A Self-adaptive MapReduce
Scheduling Algorithm in Heterogeneous Environment” authored by Quan Chen, Daqiang Zhang,
Search WWH ::




Custom Search