Database Reference
In-Depth Information
Partitions and Buckets
Hive organizes tables into partitions — a way of dividing a table into coarse-grained parts
based on the value of a partition column , such as a date. Using partitions can make it
faster to do queries on slices of the data.
Tables or partitions may be subdivided further into buckets to give extra structure to the
data that may be used for more efficient queries. For example, bucketing by user ID
means we can quickly evaluate a user-based query by running it on a randomized sample
of the total set of users.
Partitions
To take an example where partitions are commonly used, imagine logfiles where each re-
cord includes a timestamp. If we partition by date, then records for the same date will be
stored in the same partition. The advantage to this scheme is that queries that are restricted
to a particular date or set of dates can run much more efficiently, because they only need
to scan the files in the partitions that the query pertains to. Notice that partitioning doesn't
preclude more wide-ranging queries: it is still feasible to query the entire dataset across
many partitions.
A table may be partitioned in multiple dimensions. For example, in addition to partition-
ing logs by date, we might also subpartition each date partition by country to permit effi-
cient queries by location.
Partitions are defined at table creation time using the PARTITIONED BY clause, [ 112 ]
which takes a list of column definitions. For the hypothetical logfiles example, we might
define a table with records comprising a timestamp and the log line itself:
CREATE TABLE logs (ts BIGINT, line STRING)
PARTITIONED BY (dt STRING, country STRING);
When we load data into a partitioned table, the partition values are specified explicitly:
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'
INTO TABLE logs
PARTITION (dt='2001-01-01', country='GB');
At the filesystem level, partitions are simply nested subdirectories of the table directory.
After loading a few more files into the logs table, the directory structure might look like
this:
/user/hive/warehouse/logs
├── dt=2001-01-01/
├── country=GB/
Search WWH ::




Custom Search