Hive - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Partitions and Buckets

Hive organizes tables into partitions — a way of dividing a table into coarse-grained parts

based on the value of a partition column , such as a date. Using partitions can make it

faster to do queries on slices of the data.

Tables or partitions may be subdivided further into buckets to give extra structure to the

data that may be used for more efficient queries. For example, bucketing by user ID

means we can quickly evaluate a user-based query by running it on a randomized sample

of the total set of users.

Partitions

To take an example where partitions are commonly used, imagine logfiles where each re-

cord includes a timestamp. If we partition by date, then records for the same date will be

stored in the same partition. The advantage to this scheme is that queries that are restricted

to a particular date or set of dates can run much more efficiently, because they only need

to scan the files in the partitions that the query pertains to. Notice that partitioning doesn't

preclude more wide-ranging queries: it is still feasible to query the entire dataset across

many partitions.

A table may be partitioned in multiple dimensions. For example, in addition to partition-

ing logs by date, we might also subpartition each date partition by country to permit effi-

cient queries by location.

Partitions are defined at table creation time using the PARTITIONED BY clause, [ 112 ]

which takes a list of column definitions. For the hypothetical logfiles example, we might

define a table with records comprising a timestamp and the log line itself:

CREATE TABLE logs (ts BIGINT, line STRING)

PARTITIONED BY (dt STRING, country STRING);

When we load data into a partitioned table, the partition values are specified explicitly:

LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'

INTO TABLE logs

PARTITION (dt='2001-01-01', country='GB');

At the filesystem level, partitions are simply nested subdirectories of the table directory.

After loading a few more files into the logs table, the directory structure might look like

this:

/user/hive/warehouse/logs

├── dt=2001-01-01/

│

├── country=GB/

Search WWH ::

Custom Search

Home