Hive - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Take an unbucketed users table:

hive> SELECT * FROM users;

0 Nat

2 Joe

3 Kay

4 Ann

To populate the bucketed table, we need to set the hive.enforce.bucketing prop-

erty to true so that Hive knows to create the number of buckets declared in the table

definition. Then it is just a matter of using the INSERT command:

INSERT OVERWRITE TABLE bucketed_users

SELECT * FROM users;

Physically, each bucket is just a file in the table (or partition) directory. The filename is

not important, but bucket n is the nth file when arranged in lexicographic order. In fact,

buckets correspond to MapReduce output file partitions: a job will produce as many buck-

ets (output files) as reduce tasks. We can see this by looking at the layout of the buck-

eted_users table we just created. Running this command:

hive> dfs -ls /user/hive/warehouse/bucketed_users;

shows that four files were created, with the following names (the names are generated by

Hive):

000000_0

000001_0

000002_0

000003_0

The first bucket contains the users with IDs 0 and 4, since for an INT the hash is the in-

teger itself, and the value is reduced modulo the number of buckets — four, in this

case: [ 113 ]

hive> dfs -cat /user/hive/warehouse/bucketed_users/000000_0;

0Nat

4Ann

We can see the same thing by sampling the table using the TABLESAMPLE clause, which

restricts the query to a fraction of the buckets in the table rather than the whole table:

hive> SELECT * FROM bucketed_users

> TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);

4 Ann

0 Nat

Search WWH ::

Custom Search

Home