Databases Reference
In-Depth Information
scale out use of Apache Pig for a large organization; however, that will not typically be
the case in many Enterprise verticals.
Comparing with Apache Hive
Now let's take a look at Apache Hive. You'll need to install Hive according to the doc‐
umentation and in particular the “Getting Started” page in the wiki. Unpack the down‐
load, set the HIVE_HOME environment variable, and include the Hive binary in your PATH
as well.
Starting from the source code directory that you cloned in Git, connect into the part4
subdirectory. The file src/scripts/wc.q shows source for an Apache Hive script that ap‐
proximates the Cascading code in “Example 4: Replicated Joins” . To run this:
$ rm -rf derby.log metastore_db/
$ hive -hiveconf hive.metastore.warehouse.dir = /tmp < src/scripts/wc.q
The first line will clear out any metadata from a previous run. Otherwise the jobs would
fail. For larger apps, Hive requires a metadata store in some relational database. How‐
ever, the examples of Hive here could use an embedded metastore.
For the sake of space, we don't show all the output from Hive. An example is shown in
the GitHub gist for “Example 4: Replicated Joins” .
Looking at that Hive source code, first we prepare the data definition language (DDL)
for loading the raw data:
CREATE TABLE raw_docs ( doc_id STRING , text STRING )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE ;
CREATE TABLE raw_stop ( stop STRING )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE ;
LOAD DATA
LOCAL INPATH 'data/rain.txt'
OVERWRITE INTO TABLE raw_docs ;
LOAD DATA
LOCAL INPATH 'data/en.stop'
OVERWRITE INTO TABLE raw_stop ;
Next, we strip off the headers from the TSV files (anybody know a better approach for
this?):
CREATE TABLE docs ( doc_id STRING , text STRING );
INSERT OVERWRITE TABLE docs
Search WWH ::




Custom Search