Database Reference
In-Depth Information
Figure 10.5 Using a custom partitioner
As a more practical example, a user could use a partitioner to separate the output
into separate files for each calendar year for subsequent analysis. Also, a
partitioner could be used to ensure that the workload is evenly distributed across
the reducers. For example, if a few keys are known to be associated with a large
majority of the data, it may be useful to ensure that these keys go to separate
reducers to achieve better overall performance. Otherwise, one reducer might be
assigned the majority of the data, and the MapReduce job will not complete until
that one long-running reduce task completes.
Developing and Executing a Hadoop MapReduce Program
A common approach to develop a Hadoop MapReduce program is to write Java
code using an Interactive Development Environment (IDE) tool such as Eclipse
[17]. Compared to a plaintext editor or a command-line interface (CLI), IDE tools
offer a better experience to write, compile, test, and debug code. A typical
MapReduce program consists of three Java files: one each for the driver code, map
code, and reduce code. Additional, Java files can be written for the combiner or the
custom partitioner, if applicable. The Java code is compiled and stored as a Java
Archive (JAR) file. This JAR file is then executed against the specified HDFS input
files.
Beyond learning the mechanics of submitting a MapReduce job, three key
challenges to a new Hadoop developer are defining the logic of the code to use the
MapReduce paradigm; learning the Apache Hadoop Java classes, methods, and
interfaces; and implementing the driver, map, and reduce functionality in Java.
Some prior experience with Java makes it easier for a new Hadoop developer to
focus on learning Hadoop and writing the MapReduce job.
Search WWH ::




Custom Search