Adding Structure with Hive - Microsoft Big Data Solutions

Database Reference

In-Depth Information

FROM MsBigData.customer c

DISTRIBUTE BY c.state;

SORT BY c.state, c.postalCode;

Now that you have explored the basic operations in Hive, the next section

will address the more advanced features, like partitioning, views, and

indexes.

Using Advanced Data Structures with Hive

Hive has a number of advanced features. These are primarily used for

performance and ease of use. This section covers the common ones.

Setting Up Partitioned Tables

Just like most relational databases, Hive supports partitioning, though the

implementation is different. Partitioned tables are good for performance

because they help Hive narrow down the amount of data it needs to process

to respond to queries.

The columns used for partitioning should not be included in the other

columns for the table. For example, using the customer table example from

earlier, a logical partition choice would be the state column. To partition the

table by state, the state column would be removed from the column list and

added to the PARTITIONED BY clause:

CREATE TABLE MsBigData.customer (

name STRING,

city STRING,

postalCode STRING,

purchases MAP<STRING, DECIMAL>

)

PARTITIONED BY (state STRING);

There can be multiple partition columns, and the columns in the

PARTITIONED BY list cannot be repeated in the main body of the table,

because Hive considers those to be ambiguous columns. This is because

Hive stores the partition column values separately from the data in the

files. As discussed previously, Hive creates a directory to store the files

for managed tables. When a managed table is partitioned, Hive creates a

Search WWH ::

Custom Search

Home