Advanced Analytics—Technology and Tools: MapReduce and Hadoop - Data Science and Big Data Analytics

Database Reference

In-Depth Information

Figure 10.5 Using a custom partitioner

As a more practical example, a user could use a partitioner to separate the output

into separate files for each calendar year for subsequent analysis. Also, a

partitioner could be used to ensure that the workload is evenly distributed across

the reducers. For example, if a few keys are known to be associated with a large

majority of the data, it may be useful to ensure that these keys go to separate

reducers to achieve better overall performance. Otherwise, one reducer might be

assigned the majority of the data, and the MapReduce job will not complete until

that one long-running reduce task completes.

Developing and Executing a Hadoop MapReduce Program

A common approach to develop a Hadoop MapReduce program is to write Java

code using an Interactive Development Environment (IDE) tool such as Eclipse

[17]. Compared to a plaintext editor or a command-line interface (CLI), IDE tools

offer a better experience to write, compile, test, and debug code. A typical

MapReduce program consists of three Java files: one each for the driver code, map

code, and reduce code. Additional, Java files can be written for the combiner or the

custom partitioner, if applicable. The Java code is compiled and stored as a Java

Archive (JAR) file. This JAR file is then executed against the specified HDFS input

files.

Beyond learning the mechanics of submitting a MapReduce job, three key

challenges to a new Hadoop developer are defining the logic of the code to use the

MapReduce paradigm; learning the Apache Hadoop Java classes, methods, and

interfaces; and implementing the driver, map, and reduce functionality in Java.

Some prior experience with Java makes it easier for a new Hadoop developer to

focus on learning Hadoop and writing the MapReduce job.

Search WWH ::

Custom Search

Home