Database Reference
In-Depth Information
Looking on HDFS, you will find a results_wc directory under /user/hadoop/perl that contains the output of the
word-count task. As in previous examples, it is the part file that contains the result. When you dump the part file to the
session by using the Hadoop file system
cat
command and the Linux
tail
command, you limit the results to the last
10 lines, with the following resulting data:
[hadoop@hc1nn perl]$ ./wc_output.sh
Found 3 items
-rw-r--r-- 1 hadoop supergroup 0 2014-06-20 13:36 /user/hadoop/perl/results_wc/_SUCCESS
drwxr-xr-x - hadoop supergroup 0 2014-06-20 13:35 /user/hadoop/perl/results_wc/_logs
-rw-r--r-- 1 hadoop supergroup 249441 2014-06-20 13:36 /user/hadoop/perl/results_wc/part-00000
zephyr,1
zero,1
zigzag,2
zimmermann,5
zipped,1
zoar,1
zoilus,3
zone,1
zones,1
zoophytes,1
The words have been sorted and their values totaled, and many of the unwanted characters have been removed
from the words. This last example shows the wide-ranging possibility of using scripts for Map Reduce jobs with
Hadoop streaming. No Java code was needed and no code was compiled.
Summary
In this chapter you have investigated Map Reduce programming by using one example implemented in several ways.
That is, by using a single algorithm, you can better compare the different approaches.
A Java-based approach, for example, gives a low-level means to Map Reduce development. The downside is that
code volumes are large, and so costs and potential error volume increase. On a positive note, using low-level
Hadoop-based APIs gives a wide range of functionality for your data processing.
In contrast, the Apache Pig examples involved a high-level Pig native code API. This resulted in a lower code
volume and therefore lower costs and quicker times. You can also extend the functionality of Pig by writing user-
defined functions (UDFs) in Java. Pig can be a vehicle for processing HDFS-based data, and although there was no
time to cover it here, it can also load data to Hive by using a product called HCatalog.
A word-count example was then attempted using Hive, the Hadoop data warehouse. A file was imported into a
table and a count of words was created in Hive QL, an SQL-like language. While this is a functional language and quite
easy to use, it may not offer the full range of functions that are available when using Pig and UDFs. Although it was
quick to implement and needed very little code, choosing this technique depends on the complexity of your task.
Lastly, word count was coded in Perl and called via the Hadoop streaming library. This showed that a third-party
language like Python or Perl can be used to create Map Reduce jobs. In this example, unstructured text was employed
for the streaming job, making it possible to create user-defined input and output formats. See the Hadoop streaming
These different Map Reduce methods offer the ability to create simple ETL building blocks that can be used to
build complex ETL chains. Later chapters will discuss this concept in relation to products like Oozie, Talend, and
Pentaho. Therefore, the reader should consider this chapter in conjunction with Chapter 10, which will present big-
data visual ETL tools such as Talend and Pentaho; these offer a highly functional approach to ETL job creation using
object drag and drop.
Search WWH ::
Custom Search