Database Reference
In-Depth Information
Checking the last 10 lines of the results part file using the Hadoop file system cat command and the Linux tail
command gives a sorted word count with any unwanted characters removed:
[hadoop@hc1nn wordcount]$ hadoop dfs -cat /user/hadoop/edgar-results/part-00000 | tail -10
zanthe 1
zeal 2
zeboin 1
zelo 1
zephyr 1
zimmermann 1
zipped 1
zoar 1
zoilus 3
zone 1
Comparing the Examples
The two Map Reduce native word-count examples (wc-ex1.java and wc-ex2.java) show that raw Java code for Map
Reduce can create complex functionality. They are not, however, the most efficient approaches. Consider the effect
achieved when some simple pattern-filtering options were added. The listing grew from 70 lines in example 1 to 167
lines in example 2. As the code volume increases, so do the cost, complexity, and time to implement. Now, imagine
the effect on a more complex algorithm; the resulting code could quickly become even more unwieldy.
The good news is that alternatives are available. In the next sections, I will introduce some other Map Reduce
coding tools that offer the ability to code these tasks at a higher level and so reduce code volume. Generally, it is more
efficient and cheaper to use less code to achieve your objective. You should write your code at a lower level in Java
only if higher level systems like Pig native (including UDFs), which will be described in the next section, do not offer
the functionality you need.
So, next you will learn to source, install, and use Apache Pig. You will also code the same word-count algorithm
in Pig.
Map Reduce with Pig
As it is able to run in interactive or batch mode, Pig is a higher level programming language for processing large data
sets. You will be able to see that fewer lines of code are needed to carry out the same word-count example. Apache Pig
can be downloaded from pig.apache.org .
As Pig is a higher-level language, you can concentrate more on the logical flow of data processing and less on the
lower-level coding to achieve that processing. Also, Pig integrates well with the visual-object-based ETL and reporting
tools for big data that are introduced in Chapters 10 and 11 of this guide. This means that you have a quicker and
easier path into the world of data processing using Map Reduce. Although this will be explained later, tools like Talend
even help to abstract Map Reduce with its predefined Pig-based functionality.
Installing Pig
For this topic's examples, I chose to download Pig release 0.12.1 from pig.apache.org/releases.html because it is
compatible with the version of Hadoop I have been using up to this point (1.x). The download and installation are
straightforward. From the download page, you select to download Pig 0.8 and later. The Pig website then suggests
a mirror site for from which you can download (in my case, it was www.carfab.com ). After clicking that link, you're
 
Search WWH ::




Custom Search