Processing Data with Map Reduce - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

he 1585

for 1614

at 1621

his 1795

had 1839

it 1918

my 1921

as 1950

with 2563

that 2726

was 3119

I 4532

in 5149

a 5649

to 6230

and 7826

of 10538

the 18128

This is an SQL-like example that uses HiveQL to run a word count. It is easy to install and use, and Hive

provides a powerful HiveQL interface to the data. Not including the COUNT(*) line, the word-count job took just three

statements. Each of the statements issued to the Hive CLI was passed on to Hadoop as a Map Reduce task.

Whether you employ Hive for Map Reduce will depend on the data you are using, its type, and the relationships

between it and other data streams you might wish to incorporate. You have to view your use of Hive QL in terms of

your ETL chains—that is, the sequence of steps that will transform your data. You might find that Hive QL doesn't offer

the functionality to process your data; in that case, you would choose either Pig Latin or Java.

Map Reduce with Perl

Additionally, you can use the library-based data streaming functionality provided with Hadoop. The important point

to note is that this approach allows you to process streams of data using Hadoop libraries. With the Hadoop streaming

functionality, you can create Map Reduce jobs from many executable scripts, including Perl, Python, and Bash. It is

best used for textual data, as it allows data streaming between Hadoop and external systems.

In this example, you will run a Perl-based word-count task. (I present this example in Perl simply because I am

familiar with that language.) The streaming library can be found within the Hadoop release as a jar file. This is the

library within Hadoop that provides the functionality for users to write their own scripts and have Hadoop use them

for Map Reduce:

[hadoop@hc1nn hadoop]$ pwd

/usr/local/hadoop

[hadoop@hc1nn hadoop]$ ls -l contrib/streaming/hadoop-*streaming*.jar

-rw-rw-r--. 1 hadoop hadoop 107399 Jul 23 2013 contrib/streaming/hadoop-streaming-1.2.1.jar

First, we need a Perl working directory on HDFS called /user/hadoop/perl which will be used for result data for

the Map Reduce run:

[hadoop@hc1nn python]$ hadoop dfs -mkdir /user/hadoop/perl

Search WWH ::

Custom Search

Home