Database Reference
In-Depth Information
│ │ ├── file1
│ │ └── file2
│ └── country=US/
│ └── file3
└── dt=2001-01-02/
├── country=GB/
│ └── file4
└── country=US/
├── file5
└── file6
The logs table has two date partitions ( 2001-01-01 and 2001-01-02 , correspond-
ing to subdirectories called dt=2001-01-01 and dt=2001-01-02 ); and two country subpar-
titions ( GB and US , corresponding to nested subdirectories called country=GB and coun-
try=US ). The datafiles reside in the leaf directories.
We can ask Hive for the partitions in a table using SHOW PARTITIONS :
hive> SHOW PARTITIONS logs;
dt=2001-01-01/country=GB
dt=2001-01-01/country=US
dt=2001-01-02/country=GB
dt=2001-01-02/country=US
One thing to bear in mind is that the column definitions in the PARTITIONED BY clause
are full-fledged table columns, called partition columns; however, the datafiles do not
contain values for these columns, since they are derived from the directory names.
You can use partition columns in SELECT statements in the usual way. Hive performs in-
put pruning to scan only the relevant partitions. For example:
SELECT ts, dt, line
FROM logs
WHERE country='GB';
will only scan file1 , file2 , and file4 . Notice, too, that the query returns the values of the dt
partition column, which Hive reads from the directory names since they are not in the
datafiles.
Buckets
There are two reasons why you might want to organize your tables (or partitions) into
buckets. The first is to enable more efficient queries. Bucketing imposes extra structure on
the table, which Hive can take advantage of when performing certain queries. In particu-
Search WWH ::




Custom Search