Database Reference
In-Depth Information
Chapter 5
Submitting Jobs to Your HDInsight
Cluster
Apart from the cluster-management operations you saw in the previous chapter, you can use the .NET SDK and the
Windows PowerShell cmdlets to control your job submission and execution in your HDInsight cluster. The jobs are
typically MapReduce jobs because that is the only thing that Hadoop understands. You can write your MapReduce
jobs in .NET and also use supporting projects—such as Hive, Pig, and so forth—to avoid coding MapReduce
programs, which can often be tedious and time consuming.
In all the samples I have shown so far, I used the command-line consoles. However, this does not need
to be the case; you can also use PowerShell. The Console application that is used to submit the MapReduce jobs
calls a .NET Submissions API. As such, one can call the .NET API directly from within PowerShell, similar to the
cluster-management operations. You will use the same console application you created in the previous chapter and
add the functions for job submissions. In this chapter, you will learn how to implement a custom MapReduce program
in .NET and execute it as a Hadoop job. You will also take a look at how to execute the sample wordcount MapReduce
job and a Hive query using .NET and PowerShell.
Using the Hadoop .NET SDK
Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run
MapReduce jobs with any executable or script as the mapper and/or the reducer. This is essentially a Hadoop API to
MapReduce that allows you to write map and reduce functions in languages other than Java (.NET, Perl, Python, and so
on). Hadoop Streaming uses Windows streams as the interface between Hadoop and the program, so you can use any
language that can read standard input and write to standard output to write the MapReduce program. This functionality
makes streaming naturally suited for text processing. In this chapter, I focus only on .NET to leverage Hadoop streaming.
The mapper and reducer parameters are .NET types that derive from base Map and Reduce abstract classes.
The input, output, and files options are analogous to the standard Hadoop streaming submissions. The mapper and
reducer allow you to define a .NET type derived from the appropriate abstract base classes.
The objective in defining these base classes was not only to support creating .NET Mapper and Reducer classes
but also to provide a means for Setup and Cleanup operations to support in-place Mapper/Combiner/Reducer
optimizations, utilize IEnumerable and sequences for publishing data from all classes, and finally provide a simple
submission mechanism analogous to submitting Java-based jobs.
The basic logic behind MapReduce is that the Hadoop text input is processed and each input line is passed into the
Map function, which parses and filters the key/value pair for the data. The values are then sorted and merged by Hadoop.
The processed mapped data is then passed into the Reduce function, as a key and corresponding sequence of strings,
which then defines the optional output value. One important thing to keep in mind is that Hadoop Streaming is based on
text data. Thus, the inputs into the MapReduce are strings or UTF8- encoded bytes. However, when you are performing the
MapReduce operations, strings are not always suitable, but the operations do need to be able to be represented as strings.
 
Search WWH ::




Custom Search