Database Reference
In-Depth Information
Take an unbucketed users table:
hive> SELECT * FROM users;
0 Nat
2 Joe
3 Kay
4 Ann
To populate the bucketed table, we need to set the hive.enforce.bucketing prop-
erty to true so that Hive knows to create the number of buckets declared in the table
definition. Then it is just a matter of using the INSERT command:
INSERT OVERWRITE TABLE bucketed_users
SELECT * FROM users;
Physically, each bucket is just a file in the table (or partition) directory. The filename is
not important, but bucket n is the nth file when arranged in lexicographic order. In fact,
buckets correspond to MapReduce output file partitions: a job will produce as many buck-
ets (output files) as reduce tasks. We can see this by looking at the layout of the buck-
eted_users table we just created. Running this command:
hive> dfs -ls /user/hive/warehouse/bucketed_users;
shows that four files were created, with the following names (the names are generated by
Hive):
000000_0
000001_0
000002_0
000003_0
The first bucket contains the users with IDs 0 and 4, since for an INT the hash is the in-
teger itself, and the value is reduced modulo the number of buckets — four, in this
case: [ 113 ]
hive> dfs -cat /user/hive/warehouse/bucketed_users/000000_0;
0Nat
4Ann
We can see the same thing by sampling the table using the TABLESAMPLE clause, which
restricts the query to a fraction of the buckets in the table rather than the whole table:
hive> SELECT * FROM bucketed_users
> TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);
4 Ann
0 Nat
Search WWH ::




Custom Search