Databases Reference
In-Depth Information
scale out use of Apache Pig for a large organization; however, that will not typically be
the case in many Enterprise verticals.
Comparing with Apache Hive
Now let's take a look at Apache Hive. You'll need to install Hive according to
the doc‐
umentation
and in particular the “Getting Started” page in the wiki. Unpack the down‐
load, set the
HIVE_HOME
environment variable, and include the Hive binary in your
PATH
as well.
Starting from the source code directory that you cloned in Git, connect into the
part4
proximates the Cascading code in
“Example 4: Replicated Joins”
. To run this:
$
rm -rf derby.log metastore_db/
$
hive -hiveconf hive.metastore.warehouse.dir
=
/tmp < src/scripts/wc.q
The first line will clear out any metadata from a previous run. Otherwise the jobs would
fail. For larger apps, Hive requires a metadata store in some relational database. How‐
ever, the examples of Hive here could use an embedded metastore.
For the sake of space, we don't show all the output from Hive. An example is shown in
the GitHub gist for
“Example 4: Replicated Joins”
.
Looking at that Hive source code, first we prepare the data definition language (DDL)
for loading the raw data:
CREATE
TABLE
raw_docs
(
doc_id
STRING
,
text
STRING
)
ROW
FORMAT
DELIMITED
FIELDS
TERMINATED
BY
'\t'
STORED
AS
TEXTFILE
;
CREATE
TABLE
raw_stop
(
stop
STRING
)
ROW
FORMAT
DELIMITED
FIELDS
TERMINATED
BY
'\t'
STORED
AS
TEXTFILE
;
LOAD
DATA
LOCAL
INPATH
'data/rain.txt'
OVERWRITE
INTO
TABLE
raw_docs
;
LOAD
DATA
LOCAL
INPATH
'data/en.stop'
OVERWRITE
INTO
TABLE
raw_stop
;
Next, we strip off the headers from the TSV files (anybody know a better approach for
this?):
CREATE
TABLE
docs
(
doc_id
STRING
,
text
STRING
);
INSERT
OVERWRITE
TABLE
docs