Databases Reference
In-Depth Information
INT —4 byte integer
BIGINT —8 byte integer
DOUBLE —Double precision floating point
STRING —Sequence of characters
Noticeably missing is the Boolean type, which is usually handled as TINYINT. Hive also
has complex types, such as structs, maps, and arrays
that can be nested. But they're
currently not well supported in the language and are considered advanced topics.
We can attach a descriptive comment to each column, as was done here for the ip
column. In addition, we also add a descriptive comment to the table:
COMMENT 'This is the page view table'
The next part of the CREATE TABLE statement specifies the partition columns:
PARTITIONED BY (dt STRING, country STRING)
As we've discussed previously, partition columns are optimized for querying. They're
distinct from the data columns of viewTime , userid , page_url , referrer_url , and
ip . The value of a partition column for a particular row is not explicitly stored with
the row; it's implied from the directory path. But there's no syntactical difference in
queries over
partition columns or data columns.
CLUSTERED BY (userid) INTO 32 BUCKETS
The CLUSTERED BY (...) INTO ... BUCKETS clause specifies the bucketing informa-
tion, including the column that random samples will be taken from and also how many
buckets to create. The choice of the number of buckets
will depend on the following:
The size of your data under each partition
1
The size of sample you intend to use
2
The first criterion is important because after you divide a partition into the specified
number of buckets, you wouldn't want each bucket file to be so small that it becomes
inefficient for Hadoop to handle. On the other hand, a bucket should be the same size
or smaller than your intended sample size. Bucketing by user into 32 buckets is a good
setting if your sample size will be about 3 percent (~1/32) of your user base.
NOTE Unlike partitioning, Hive doesn't automatically enforce bucketing when
data is written to a table. Specifying bucketing information merely tells Hive
that you'll manually enforce the bucketing (sampling) criteria when data is
written to a table and that Hive can take advantage of it in processing queries.
To enforce the bucketing criteria you need to correctly set the number of
reducers when populating the table. More detail can be found in http://wiki.
apache.org/hadoop/Hive/LanguageManual/DDL/BucketedTables.
The ROW FORMAT clause
tells Hive how the table data is stored per row. Without this
clause, Hive defaults to the newline character as the row delimiter and an ASCII value
 
Search WWH ::




Custom Search