Hive - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

│ │ ├── file1

│ │ └── file2

│ └── country=US/

│ └── file3

└── dt=2001-01-02/

├── country=GB/

│ └── file4

└── country=US/

├── file5

└── file6

The logs table has two date partitions ( 2001-01-01 and 2001-01-02 , correspond-

ing to subdirectories called dt=2001-01-01 and dt=2001-01-02 ); and two country subpar-

titions ( GB and US , corresponding to nested subdirectories called country=GB and coun-

try=US ). The datafiles reside in the leaf directories.

We can ask Hive for the partitions in a table using SHOW PARTITIONS :

hive> SHOW PARTITIONS logs;

dt=2001-01-01/country=GB

dt=2001-01-01/country=US

dt=2001-01-02/country=GB

dt=2001-01-02/country=US

One thing to bear in mind is that the column definitions in the PARTITIONED BY clause

are full-fledged table columns, called partition columns; however, the datafiles do not

contain values for these columns, since they are derived from the directory names.

You can use partition columns in SELECT statements in the usual way. Hive performs in-

put pruning to scan only the relevant partitions. For example:

SELECT ts, dt, line

FROM logs

WHERE country='GB';

will only scan file1 , file2 , and file4 . Notice, too, that the query returns the values of the dt

partition column, which Hive reads from the directory names since they are not in the

datafiles.

Buckets

There are two reasons why you might want to organize your tables (or partitions) into

buckets. The first is to enable more efficient queries. Bucketing imposes extra structure on

the table, which Hive can take advantage of when performing certain queries. In particu-

Search WWH ::

Custom Search

Home