Processing Data with Map Reduce - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Looking on HDFS, you will find a results_wc directory under /user/hadoop/perl that contains the output of the

word-count task. As in previous examples, it is the part file that contains the result. When you dump the part file to the

session by using the Hadoop file system cat command and the Linux tail command, you limit the results to the last

10 lines, with the following resulting data:

[hadoop@hc1nn perl]$ ./wc_output.sh

Found 3 items

-rw-r--r-- 1 hadoop supergroup 0 2014-06-20 13:36 /user/hadoop/perl/results_wc/_SUCCESS

drwxr-xr-x - hadoop supergroup 0 2014-06-20 13:35 /user/hadoop/perl/results_wc/_logs

-rw-r--r-- 1 hadoop supergroup 249441 2014-06-20 13:36 /user/hadoop/perl/results_wc/part-00000

zephyr,1

zero,1

zigzag,2

zimmermann,5

zipped,1

zoar,1

zoilus,3

zone,1

zones,1

zoophytes,1

The words have been sorted and their values totaled, and many of the unwanted characters have been removed

from the words. This last example shows the wide-ranging possibility of using scripts for Map Reduce jobs with

Hadoop streaming. No Java code was needed and no code was compiled.

Summary

In this chapter you have investigated Map Reduce programming by using one example implemented in several ways.

That is, by using a single algorithm, you can better compare the different approaches.

A Java-based approach, for example, gives a low-level means to Map Reduce development. The downside is that

code volumes are large, and so costs and potential error volume increase. On a positive note, using low-level

Hadoop-based APIs gives a wide range of functionality for your data processing.

In contrast, the Apache Pig examples involved a high-level Pig native code API. This resulted in a lower code

volume and therefore lower costs and quicker times. You can also extend the functionality of Pig by writing user-

defined functions (UDFs) in Java. Pig can be a vehicle for processing HDFS-based data, and although there was no

time to cover it here, it can also load data to Hive by using a product called HCatalog.

A word-count example was then attempted using Hive, the Hadoop data warehouse. A file was imported into a

table and a count of words was created in Hive QL, an SQL-like language. While this is a functional language and quite

easy to use, it may not offer the full range of functions that are available when using Pig and UDFs. Although it was

quick to implement and needed very little code, choosing this technique depends on the complexity of your task.

Lastly, word count was coded in Perl and called via the Hadoop streaming library. This showed that a third-party

language like Python or Perl can be used to create Map Reduce jobs. In this example, unstructured text was employed

for the streaming job, making it possible to create user-defined input and output formats. See the Hadoop streaming

guide at http://hadoop.apache.org/docs/r1.2.1/streaming.html#Hadoop+Streaming .

These different Map Reduce methods offer the ability to create simple ETL building blocks that can be used to

build complex ETL chains. Later chapters will discuss this concept in relation to products like Oozie, Talend, and

Pentaho. Therefore, the reader should consider this chapter in conjunction with Chapter 10, which will present big-

data visual ETL tools such as Talend and Pentaho; these offer a highly functional approach to ETL job creation using

object drag and drop.

Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Search WWH ::

Custom Search

Home