Database Reference
In-Depth Information
│ │ ├── file1
│ │ └── file2
│ └── country=US/
│ └── file3
└── dt=2001-01-02/
├── country=GB/
│ └── file4
└── country=US/
├── file5
└── file6
The
logs
table has two date partitions (
2001-01-01
and
2001-01-02
, correspond-
ing to subdirectories called
dt=2001-01-01
and
dt=2001-01-02
); and two country subpar-
titions (
GB
and
US
, corresponding to nested subdirectories called
country=GB
and
coun-
try=US
). The datafiles reside in the leaf directories.
We can ask Hive for the partitions in a table using
SHOW PARTITIONS
:
hive>
SHOW PARTITIONS logs;
dt=2001-01-01/country=GB
dt=2001-01-01/country=US
dt=2001-01-02/country=GB
dt=2001-01-02/country=US
One thing to bear in mind is that the column definitions in the
PARTITIONED BY
clause
are full-fledged table columns, called partition columns; however, the datafiles do not
contain values for these columns, since they are derived from the directory names.
You can use partition columns in
SELECT
statements in the usual way. Hive performs
in-
put pruning
to scan only the relevant partitions. For example:
SELECT ts, dt, line
FROM logs
WHERE country='GB';
will only scan
file1
,
file2
, and
file4
. Notice, too, that the query returns the values of the
dt
partition column, which Hive reads from the directory names since they are not in the
datafiles.
Buckets
There are two reasons why you might want to organize your tables (or partitions) into
buckets. The first is to enable more efficient queries. Bucketing imposes extra structure on
the table, which Hive can take advantage of when performing certain queries. In particu-