Hive and the Hadoop herd - Hadoop in Action

Databases Reference

In-Depth Information

INT —4 byte integer

■

BIGINT —8 byte integer

■

DOUBLE —Double precision floating point

■

STRING —Sequence of characters

■

Noticeably missing is the Boolean type, which is usually handled as TINYINT. Hive also

has complex types, such as structs, maps, and arrays

that can be nested. But they're

currently not well supported in the language and are considered advanced topics.

We can attach a descriptive comment to each column, as was done here for the ip

column. In addition, we also add a descriptive comment to the table:

COMMENT 'This is the page view table'

The next part of the CREATE TABLE statement specifies the partition columns:

PARTITIONED BY (dt STRING, country STRING)

As we've discussed previously, partition columns are optimized for querying. They're

distinct from the data columns of viewTime , userid , page_url , referrer_url , and

ip . The value of a partition column for a particular row is not explicitly stored with

the row; it's implied from the directory path. But there's no syntactical difference in

queries over

partition columns or data columns.

CLUSTERED BY (userid) INTO 32 BUCKETS

The CLUSTERED BY (...) INTO ... BUCKETS clause specifies the bucketing informa-

tion, including the column that random samples will be taken from and also how many

buckets to create. The choice of the number of buckets

will depend on the following:

The size of your data under each partition

1

The size of sample you intend to use

2

The first criterion is important because after you divide a partition into the specified

number of buckets, you wouldn't want each bucket file to be so small that it becomes

inefficient for Hadoop to handle. On the other hand, a bucket should be the same size

or smaller than your intended sample size. Bucketing by user into 32 buckets is a good

setting if your sample size will be about 3 percent (~1/32) of your user base.

NOTE Unlike partitioning, Hive doesn't automatically enforce bucketing when

data is written to a table. Specifying bucketing information merely tells Hive

that you'll manually enforce the bucketing (sampling) criteria when data is

written to a table and that Hive can take advantage of it in processing queries.

To enforce the bucketing criteria you need to correctly set the number of

reducers when populating the table. More detail can be found in http://wiki.

apache.org/hadoop/Hive/LanguageManual/DDL/BucketedTables.

The ROW FORMAT clause

tells Hive how the table data is stored per row. Without this

clause, Hive defaults to the newline character as the row delimiter and an ASCII value

Hadoop in Action

Search WWH ::

Custom Search

Home