Database Reference
In-Depth Information
FROM MsBigData.customer c
DISTRIBUTE BY c.state;
SORT BY c.state, c.postalCode;
Now that you have explored the basic operations in Hive, the next section
will address the more advanced features, like partitioning, views, and
indexes.
Using Advanced Data Structures with Hive
Hive has a number of advanced features. These are primarily used for
performance and ease of use. This section covers the common ones.
Setting Up Partitioned Tables
Just like most relational databases, Hive supports partitioning, though the
implementation is different. Partitioned tables are good for performance
because they help Hive narrow down the amount of data it needs to process
to respond to queries.
The columns used for partitioning should not be included in the other
columns for the table. For example, using the customer table example from
earlier, a logical partition choice would be the state column. To partition the
table by state, the state column would be removed from the column list and
added to the PARTITIONED BY clause:
CREATE TABLE MsBigData.customer (
name STRING,
city STRING,
postalCode STRING,
purchases MAP<STRING, DECIMAL>
)
PARTITIONED BY (state STRING);
There can be multiple partition columns, and the columns in the
PARTITIONED BY list cannot be repeated in the main body of the table,
because Hive considers those to be ambiguous columns. This is because
Hive stores the partition column values separately from the data in the
files. As discussed previously, Hive creates a directory to store the files
for managed tables. When a managed table is partitioned, Hive creates a
Search WWH ::




Custom Search