Database Reference
In-Depth Information
Hadoop Streaming
Hadoop provides an API to MapReduce that allows you to write your map and reduce
functions in languages other than Java. Hadoop Streaming uses Unix standard streams as
the interface between Hadoop and your program, so you can use any language that can read
standard input and write to standard output to write your MapReduce program. [ 22 ]
Streaming is naturally suited for text processing. Map input data is passed over standard in-
put to your map function, which processes it line by line and writes lines to standard out-
put. A map output key-value pair is written as a single tab-delimited line. Input to the re-
duce function is in the same format — a tab-separated key-value pair — passed over stand-
ard input. The reduce function reads lines from standard input, which the framework guar-
antees are sorted by key, and writes its results to standard output.
Let's illustrate this by rewriting our MapReduce program for finding maximum temperat-
ures by year in Streaming.
Ruby
The map function can be expressed in Ruby as shown in Example 2-7 .
Example 2-7. Map function for maximum temperature in Ruby
#!/usr/bin/env ruby
STDIN . each_line do | line |
val = line
year , temp , q = val [ 15 , 4 ], val [ 87 , 5 ], val [ 92 , 1 ]
puts " #{ year } \t #{ temp } " if ( temp != "+9999" && q =~ /[01459]/ )
end
The program iterates over lines from standard input by executing a block for each line from
STDIN (a global constant of type IO ). The block pulls out the relevant fields from each in-
put line and, if the temperature is valid, writes the year and the temperature separated by a
tab character, \t , to standard output (using puts ).
Search WWH ::




Custom Search