Databases Reference
In-Depth Information
> rating INT,
> tstamp STRING)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '#'
> STORED AS TEXTFILE;
OK
Time taken: 0.169 seconds
hive_movielens.txt
Hive includes utilities to load data sets from fl at fi les using the LOAD DATA command. The source
could be the local fi lesystem or an HDFS volume. The command signature is like so:
LOAD DATA LOCAL INPATH <'path/to/flat/file'> OVERWRITE INTO TABLE <table name>;
No validation is performed at load time. Therefore, it's a developer's responsibility to ensure that
the fl at fi le data format and the table schema match. The syntax allows you to specify the source
as the local fi lesystem or HDFS. Essentially, specifying LOCAL after LOAD DATA tells the command
that the source is on the local fi lesystem. Not including LOCAL means that the data is in HDFS.
When the fl at fi le is in HDFS, the data is copied only into the Hive HDFS namespace. The operation
is an HDFS fi le move operation and so it is much faster than a data load operation from the local
fi lesystem. The data loading command also enables you to overwrite data into an existing table
or append to it. The presence and absence of OVERWRITE in the command suggests overwrite and
append, respectively.
The movie lens data is downloaded to the local fi lesystem. A slightly modifi ed copy of the data is
prepared by replacing the delimiter :: with # . The prepared data set is loaded into the Hive HDFS
namespace. The command for data loading is as follows:
hive> LOAD DATA LOCAL INPATH '/path/to/ratings.dat.hash_delimited'
> OVERWRITE INTO TABLE ratings;
Copying data from file:/path/to/ratings.dat.hash_delimited
Loading data to table ratings
OK
Time taken: 0.803 seconds
Available for
download on
Wrox.com
hive_movielens.txt
The movie lens ratings data that was just loaded into a Hive table contains over a million records.
You could verify that using the familiar SELECT COUNT idiom as follows:
hive> SELECT COUNT(*) FROM ratings;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
Available for
download on
Wrox.com
Search WWH ::




Custom Search