Databases Reference
In-Depth Information
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201102211022_0012, Tracking URL =
http://localhost:50030/jobdetails.jsp?jobid=job_201102211022_0012
Kill Command = /Users/tshanky/Applications/hadoop/bin/../bin/hadoop job -
Dmapred.job.tracker=localhost:9001 -kill job_201102211022_0012
2011-02-21 15:36:50,627 Stage-1 map = 0%, reduce = 0%
2011-02-21 15:36:56,819 Stage-1 map = 100%, reduce = 0%
2011-02-21 15:37:01,921 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201102211022_0012
OK
1000209
Time taken: 21.355 seconds
hive_movielens.txt
The output confi rms that more than a million ratings records are in the table. The query mechanism
confi rms that the old ways of counting in SQL work in Hive. In the counting example, I liberally
included the entire console output with the SELECT COUNT command to bring to your attention a
couple of important notes, which are as follows:
Hive operations translate to MapReduce jobs.
The latency of Hive operation responses is relatively high. It took 21.355 seconds to run a
count. An immediate re-run does no better. It again takes about the same time, because no
query caching mechanisms are in place.
Hive is capable of an exhaustive set of fi lter and aggregation queries. You can fi lter data sets using
the WHERE clause. Results can be grouped using the GROUP BY command. Distinct values can be
listed with the help of the DISTINCT parameter and two tables can be combined using the JOIN
operation. In addition, you could write custom scripts to manipulate data and pass that on to your
map and reduce functions.
To learn more about Hive's capabilities and its powerful query mechanisms, let's also load the
movies and users data sets from the movie lens data set into corresponding tables. This would
provide a good sample set to explore Hive features by trying them out against this data set. Each
row in the movies data set is in the following format: MovieID::Title::Genres . MovieID is an
integer and Title is a string. Genres is also a string. The Genres string contains multiple values in
a pipe-delimited format. In the fi rst pass, you create a movies table as follows:
As with the ratings data, the original delimiter in movies.dat is changed from
:: to # .
hive> CREATE TABLE movies(
> movieid INT,
> title STRING,
> genres STRING)
> ROW FORMAT DELIMITED
Available for
download on
Wrox.com
Search WWH ::




Custom Search