Submitting Jobs to Your HDInsight Cluster - Pro Microsoft HDInsight: Hadoop on Windows

Database Reference

In-Depth Information

Chapter 5

Submitting Jobs to Your HDInsight

Cluster

Apart from the cluster-management operations you saw in the previous chapter, you can use the .NET SDK and the

Windows PowerShell cmdlets to control your job submission and execution in your HDInsight cluster. The jobs are

typically MapReduce jobs because that is the only thing that Hadoop understands. You can write your MapReduce

jobs in .NET and also use supporting projects—such as Hive, Pig, and so forth—to avoid coding MapReduce

programs, which can often be tedious and time consuming.

In all the samples I have shown so far, I used the command-line consoles. However, this does not need

to be the case; you can also use PowerShell. The Console application that is used to submit the MapReduce jobs

calls a .NET Submissions API. As such, one can call the .NET API directly from within PowerShell, similar to the

cluster-management operations. You will use the same console application you created in the previous chapter and

add the functions for job submissions. In this chapter, you will learn how to implement a custom MapReduce program

in .NET and execute it as a Hadoop job. You will also take a look at how to execute the sample wordcount MapReduce

job and a Hive query using .NET and PowerShell.

Using the Hadoop .NET SDK

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run

MapReduce jobs with any executable or script as the mapper and/or the reducer. This is essentially a Hadoop API to

MapReduce that allows you to write map and reduce functions in languages other than Java (.NET, Perl, Python, and so

on). Hadoop Streaming uses Windows streams as the interface between Hadoop and the program, so you can use any

language that can read standard input and write to standard output to write the MapReduce program. This functionality

makes streaming naturally suited for text processing. In this chapter, I focus only on .NET to leverage Hadoop streaming.

The mapper and reducer parameters are .NET types that derive from base Map and Reduce abstract classes.

The input, output, and files options are analogous to the standard Hadoop streaming submissions. The mapper and

reducer allow you to define a .NET type derived from the appropriate abstract base classes.

The objective in defining these base classes was not only to support creating .NET Mapper and Reducer classes

but also to provide a means for Setup and Cleanup operations to support in-place Mapper/Combiner/Reducer

optimizations, utilize IEnumerable and sequences for publishing data from all classes, and finally provide a simple

submission mechanism analogous to submitting Java-based jobs.

The basic logic behind MapReduce is that the Hadoop text input is processed and each input line is passed into the

Map function, which parses and filters the key/value pair for the data. The values are then sorted and merged by Hadoop.

The processed mapped data is then passed into the Reduce function, as a key and corresponding sequence of strings,

which then defines the optional output value. One important thing to keep in mind is that Hadoop Streaming is based on

text data. Thus, the inputs into the MapReduce are strings or UTF8- encoded bytes. However, when you are performing the

MapReduce operations, strings are not always suitable, but the operations do need to be able to be represented as strings.

Search WWH ::

Custom Search

Home