Database Reference
In-Depth Information
Avro-based imports can be processed using the APIs described in Avro MapReduce . With
the Generic Avro mapping, the MapReduce program does not need to use schema-specific
generated code (although this is an option too, by using Avro's Specific compiler; Sqoop
does not do the code generation in this case). The example code includes a program called
MaxWidgetIdGenericAvro , which finds the widget with the highest ID and writes
out the result in an Avro datafile.
Imported Data and Hive
As we'll see in Chapter 17 , for many types of analysis, using a system such as Hive to
handle relational operations can dramatically ease the development of the analytic
pipeline. Especially for data originally from a relational data source, using Hive makes a
lot of sense. Hive and Sqoop together form a powerful toolchain for performing analysis.
Suppose we had another log of data in our system, coming from a web-based widget pur-
chasing system. This might return logfiles containing a widget ID, a quantity, a shipping
address, and an order date.
Here is a snippet from an example log of this type:
1,15,120 Any St.,Los Angeles,CA,90210,2010-08-01
3,4,120 Any St.,Los Angeles,CA,90210,2010-08-01
2,5,400 Some Pl.,Cupertino,CA,95014,2010-07-30
2,7,88 Mile Rd.,Manhattan,NY,10005,2010-07-18
By using Hadoop to analyze this purchase log, we can gain insight into our sales opera-
tion. By combining this data with the data extracted from our relational data source (the
widgets table), we can do better. In this example session, we will compute which zip
code is responsible for the most sales dollars, so we can better focus our sales team's oper-
ations. Doing this requires data from both the sales log and the widgets table.
The table shown in the previous code snippet should be in a local file named sales.log for
this to work.
First, let's load the sales data into Hive:
hive> CREATE TABLE sales(widget_id INT, qty INT,
> street STRING, city STRING, state STRING,
> zip INT, sale_date STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
OK
Time taken: 5.248 seconds
hive> LOAD DATA LOCAL INPATH "ch15-sqoop/sales.log" INTO TABLE sales;
...
Loading data to table default.sales
Search WWH ::




Custom Search