Database Reference
In-Depth Information
lar, a join of two tables that are bucketed on the same columns — which include the join
columns — can be efficiently implemented as a map-side join.
The second reason to bucket a table is to make sampling more efficient. When working
with large datasets, it is very convenient to try out queries on a fraction of your dataset
while you are in the process of developing or refining them. We will see how to do effi-
cient sampling at the end of this section.
First, let's see how to tell Hive that a table should be bucketed. We use the CLUSTERED
BY clause to specify the columns to bucket on and the number of buckets:
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
Here we are using the user ID to determine the bucket (which Hive does by hashing the
value and reducing modulo the number of buckets), so any particular bucket will effect-
ively have a random set of users in it.
In the map-side join case, where the two tables are bucketed in the same way, a mapper
processing a bucket of the left table knows that the matching rows in the right table are in
its corresponding bucket, so it need only retrieve that bucket (which is a small fraction of
all the data stored in the right table) to effect the join. This optimization also works when
the number of buckets in the two tables are multiples of each other; they do not have to
have exactly the same number of buckets. The HiveQL for joining two bucketed tables is
shown in Map joins .
The data within a bucket may additionally be sorted by one or more columns. This allows
even more efficient map-side joins, since the join of each bucket becomes an efficient
merge sort. The syntax for declaring that a table has sorted buckets is:
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;
How can we make sure the data in our table is bucketed? Although it's possible to load
data generated outside Hive into a bucketed table, it's often easier to get Hive to do the
bucketing, usually from an existing table.
WARNING
Hive does not check that the buckets in the datafiles on disk are consistent with the buckets in the table
definition (either in number or on the basis of bucketing columns). If there is a mismatch, you may get
an error or undefined behavior at query time. For this reason, it is advisable to get Hive to perform the
bucketing.
Search WWH ::




Custom Search