Analyzing Big Data with Hive - Professional NoSQL - page 248

Databases Reference

In-Depth Information

4 835

5 820

Time taken: 24.908 seconds

You can include multiple aggregation functions, like count, sum, and average, in a single query as

long as they all operate on the same column. You are not allowed to run aggregation functions on

multiple columns in the same query.

To run aggregation at the map level, you could set hive.map.aggr to true and run a count query

as follows:

set hive.map.aggr=true;

SELECT COUNT(*) FROM ratings;

Hive QL also supports ordering of result sets in ascending and descending order using the ORDER BY

clause. To get all records from the movies tables ordered by movieid in descending order, you can

query as follows:

hive> SELECT * FROM movies

> ORDER BY movieid DESC;

Available for

download on

Wrox.com

hive_movielens.txt

Hive has another ordering facility. It's SORT BY , which is similar to ORDER BY in that it orders

records in ascending or descending order. However, unlike ORDER BY , SORT BY applies ordering on

a per-reducer basis. This means the fi nal result set may be partially ordered. All records managed by

the same reducer will be ordered but records across reducers will not be ordered.

Hive allows partitioning of data sets on the basis of a virtual column. You can distribute partitioned

data to separate reducers by using the DISTRIBUTE BY method. Data distributed to different

reducers can be sorted on a per-reducer basis. Shorthand for DISTRIBUTE BY and ORDER BY together

is CLUSTER BY .

Hive QL's SQL-like syntax and semantics is very inviting for developers who are familiar with

RDBMS and SQL and want to explore the world of large data processing with Hadoop using

familiar tools. SQL developers who start exploring Hive soon start craving their power tool: the

SQL join. Hive doesn't disappoint even in this facility. Hive QL supports joins.

JOIN(S) IN HIVE QL

Hive supports equality joins, outer joins, and left semi-joins. To get a list of movie ratings with

movie titles you can obtain the result set by joining the ratings and the movies tables. You can

query as follows:

hive> SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title

> FROM ratings JOIN movies

> ON (ratings.movieid = movies.movieid)

> LIMIT 5;

Next Page

Professional NoSQL

Search WWH ::

Custom Search

Home