Processing Data with Map Reduce - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Checking the last 10 lines of the results part file using the Hadoop file system cat command and the Linux tail

command gives a sorted word count with any unwanted characters removed:

[hadoop@hc1nn wordcount]$ hadoop dfs -cat /user/hadoop/edgar-results/part-00000 | tail -10

zanthe 1

zeal 2

zeboin 1

zelo 1

zephyr 1

zimmermann 1

zipped 1

zoar 1

zoilus 3

zone 1

Comparing the Examples

The two Map Reduce native word-count examples (wc-ex1.java and wc-ex2.java) show that raw Java code for Map

Reduce can create complex functionality. They are not, however, the most efficient approaches. Consider the effect

achieved when some simple pattern-filtering options were added. The listing grew from 70 lines in example 1 to 167

lines in example 2. As the code volume increases, so do the cost, complexity, and time to implement. Now, imagine

the effect on a more complex algorithm; the resulting code could quickly become even more unwieldy.

The good news is that alternatives are available. In the next sections, I will introduce some other Map Reduce

coding tools that offer the ability to code these tasks at a higher level and so reduce code volume. Generally, it is more

efficient and cheaper to use less code to achieve your objective. You should write your code at a lower level in Java

only if higher level systems like Pig native (including UDFs), which will be described in the next section, do not offer

the functionality you need.

So, next you will learn to source, install, and use Apache Pig. You will also code the same word-count algorithm

in Pig.

Map Reduce with Pig

As it is able to run in interactive or batch mode, Pig is a higher level programming language for processing large data

sets. You will be able to see that fewer lines of code are needed to carry out the same word-count example. Apache Pig

can be downloaded from pig.apache.org .

As Pig is a higher-level language, you can concentrate more on the logical flow of data processing and less on the

lower-level coding to achieve that processing. Also, Pig integrates well with the visual-object-based ETL and reporting

tools for big data that are introduced in Chapters 10 and 11 of this guide. This means that you have a quicker and

easier path into the world of data processing using Map Reduce. Although this will be explained later, tools like Talend

even help to abstract Map Reduce with its predefined Pig-based functionality.

Installing Pig

For this topic's examples, I chose to download Pig release 0.12.1 from pig.apache.org/releases.html because it is

compatible with the version of Hadoop I have been using up to this point (1.x). The download and installation are

straightforward. From the download page, you select to download Pig 0.8 and later. The Pig website then suggests

a mirror site for from which you can download (in my case, it was www.carfab.com ). After clicking that link, you're

Search WWH ::

Custom Search

Home