Analyzing Big Data with Hive - Professional NoSQL - page 241

Databases Reference

In-Depth Information

In order to set a constant number of reducers:

set mapred.reduce.tasks=<number>

Starting Job = job_201102211022_0012, Tracking URL =

http://localhost:50030/jobdetails.jsp?jobid=job_201102211022_0012

Kill Command = /Users/tshanky/Applications/hadoop/bin/../bin/hadoop job -

Dmapred.job.tracker=localhost:9001 -kill job_201102211022_0012

2011-02-21 15:36:50,627 Stage-1 map = 0%, reduce = 0%

2011-02-21 15:36:56,819 Stage-1 map = 100%, reduce = 0%

2011-02-21 15:37:01,921 Stage-1 map = 100%, reduce = 100%

Ended Job = job_201102211022_0012

OK

1000209

Time taken: 21.355 seconds

hive_movielens.txt

The output confi rms that more than a million ratings records are in the table. The query mechanism

confi rms that the old ways of counting in SQL work in Hive. In the counting example, I liberally

included the entire console output with the SELECT COUNT command to bring to your attention a

couple of important notes, which are as follows:

Hive operations translate to MapReduce jobs.

The latency of Hive operation responses is relatively high. It took 21.355 seconds to run a

count. An immediate re-run does no better. It again takes about the same time, because no

query caching mechanisms are in place.

➤

➤

Hive is capable of an exhaustive set of fi lter and aggregation queries. You can fi lter data sets using

the WHERE clause. Results can be grouped using the GROUP BY command. Distinct values can be

listed with the help of the DISTINCT parameter and two tables can be combined using the JOIN

operation. In addition, you could write custom scripts to manipulate data and pass that on to your

map and reduce functions.

To learn more about Hive's capabilities and its powerful query mechanisms, let's also load the

movies and users data sets from the movie lens data set into corresponding tables. This would

provide a good sample set to explore Hive features by trying them out against this data set. Each

row in the movies data set is in the following format: MovieID::Title::Genres . MovieID is an

integer and Title is a string. Genres is also a string. The Genres string contains multiple values in

a pipe-delimited format. In the fi rst pass, you create a movies table as follows:

As with the ratings data, the original delimiter in movies.dat is changed from

:: to # .

hive> CREATE TABLE movies(

> movieid INT,

> title STRING,

> genres STRING)

> ROW FORMAT DELIMITED

Available for

download on

Wrox.com

Next Page

Professional NoSQL

Search WWH ::

Custom Search

Home