Extending Pipe Assemblies - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

scale out use of Apache Pig for a large organization; however, that will not typically be

the case in many Enterprise verticals.

Comparing with Apache Hive

Now let's take a look at Apache Hive. You'll need to install Hive according to the doc‐

umentation and in particular the “Getting Started” page in the wiki. Unpack the down‐

load, set the HIVE_HOME environment variable, and include the Hive binary in your PATH

as well.

Starting from the source code directory that you cloned in Git, connect into the part4

subdirectory. The file src/scripts/wc.q shows source for an Apache Hive script that ap‐

proximates the Cascading code in “Example 4: Replicated Joins” . To run this:

$ rm -rf derby.log metastore_db/

$ hive -hiveconf hive.metastore.warehouse.dir = /tmp < src/scripts/wc.q

The first line will clear out any metadata from a previous run. Otherwise the jobs would

fail. For larger apps, Hive requires a metadata store in some relational database. How‐

ever, the examples of Hive here could use an embedded metastore.

For the sake of space, we don't show all the output from Hive. An example is shown in

the GitHub gist for “Example 4: Replicated Joins” .

Looking at that Hive source code, first we prepare the data definition language (DDL)

for loading the raw data:

CREATE TABLE raw_docs ( doc_id STRING , text STRING )

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

STORED AS TEXTFILE ;

CREATE TABLE raw_stop ( stop STRING )

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

STORED AS TEXTFILE ;

LOAD DATA

LOCAL INPATH 'data/rain.txt'

OVERWRITE INTO TABLE raw_docs ;

LOAD DATA

LOCAL INPATH 'data/en.stop'

OVERWRITE INTO TABLE raw_stop ;

Next, we strip off the headers from the TSV files (anybody know a better approach for

this?):

CREATE TABLE docs ( doc_id STRING , text STRING );

INSERT OVERWRITE TABLE docs

Search WWH ::

Custom Search

Home