Databases Reference
In-Depth Information
4 835
5 820
Time taken: 24.908 seconds
You can include multiple aggregation functions, like count, sum, and average, in a single query as
long as they all operate on the same column. You are not allowed to run aggregation functions on
multiple columns in the same query.
To run aggregation at the map level, you could set
hive.map.aggr
to
true
and run a count query
as follows:
set hive.map.aggr=true;
SELECT COUNT(*) FROM ratings;
Hive QL also supports ordering of result sets in ascending and descending order using the
ORDER BY
clause. To get all records from the
movies
tables ordered by
movieid
in descending order, you can
query as follows:
hive> SELECT * FROM movies
> ORDER BY movieid DESC;
Available for
download on
Wrox.com
hive_movielens.txt
Hive has another ordering facility. It's
SORT BY
, which is similar to
ORDER BY
in that it orders
records in ascending or descending order. However, unlike
ORDER BY
,
SORT BY
applies ordering on
a per-reducer basis. This means the fi nal result set may be partially ordered. All records managed by
the same reducer will be ordered but records across reducers will not be ordered.
Hive allows partitioning of data sets on the basis of a virtual column. You can distribute partitioned
data to separate reducers by using the
DISTRIBUTE BY
method. Data distributed to different
reducers can be sorted on a per-reducer basis. Shorthand for
DISTRIBUTE BY
and
ORDER BY
together
is
CLUSTER BY
.
Hive QL's SQL-like syntax and semantics is very inviting for developers who are familiar with
RDBMS and SQL and want to explore the world of large data processing with Hadoop using
familiar tools. SQL developers who start exploring Hive soon start craving their power tool: the
SQL join. Hive doesn't disappoint even in this facility. Hive QL supports joins.
JOIN(S) IN HIVE QL
Hive supports equality joins, outer joins, and left semi-joins. To get a list of movie ratings with
movie titles you can obtain the result set by joining the
ratings
and the
movies
tables. You can
query as follows:
hive> SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title
> FROM ratings JOIN movies
> ON (ratings.movieid = movies.movieid)
> LIMIT 5;