Database Reference
In-Depth Information
Working with Imported Data
Once data has been imported to HDFS, it is ready for processing by custom MapReduce
programs. Text-based imports can easily be used in scripts run with Hadoop Streaming or
in MapReduce jobs run with the default TextInputFormat .
To use individual fields of an imported record, though, the field delimiters (and any escape/
enclosing characters) must be parsed and the field values extracted and converted to the ap-
propriate data types. For example, the ID of the “sprocket” widget is represented as the
string "1" in the text file, but should be parsed into an Integer or int variable in Java.
The generated table class provided by Sqoop can automate this process, allowing you to fo-
cus on the actual MapReduce job to run. Each autogenerated class has several overloaded
methods named parse() that operate on the data represented as Text ,
CharSequence , char[] , or other common types.
The MapReduce application called MaxWidgetId (available in the example code) will
find the widget with the highest ID. The class can be compiled into a JAR file along with
Widget.java using the Maven POM that comes with the example code. The JAR file is
called sqoop-examples.jar , and is executed like so:
% HADOOP_CLASSPATH=$SQOOP_HOME/sqoop- version .jar hadoop jar \
> sqoop-examples.jar MaxWidgetId -libjars $SQOOP_HOME/
sqoop- version .jar
This command line ensures that Sqoop is on the classpath locally (via
$HADOOP_CLASSPATH ) when running the MaxWidgetId.run() method, as well as
when map tasks are running on the cluster (via the -libjars argument).
When run, the maxwidget path in HDFS will contain a file named part-r-00000 with the
following expected result:
3,gadget,99.99,1983-08-13,13,Our flagship product
It is worth noting that in this example MapReduce program, a Widget object was emitted
from the mapper to the reducer; the autogenerated Widget class implements the Writ-
able interface provided by Hadoop, which allows the object to be sent via Hadoop's seri-
alization mechanism, as well as written to and read from SequenceFile s.
The MaxWidgetId example is built on the new MapReduce API. MapReduce applica-
tions that rely on Sqoop-generated code can be built on the new or old APIs, though some
advanced features (such as working with large objects) are more convenient to use in the
new API.
Search WWH ::




Custom Search