Database Reference
In-Depth Information
Bucket numbering is 1-based, so this query retrieves all the users from the first of four
buckets. For a large, evenly distributed dataset, approximately one-quarter of the table's
rows would be returned. It's possible to sample a number of buckets by specifying a dif-
ferent proportion (which need not be an exact multiple of the number of buckets, as
sampling is not intended to be a precise operation). For example, this query returns half of
the buckets:
hive> SELECT * FROM bucketed_users
> TABLESAMPLE(BUCKET 1 OUT OF 2 ON id);
4 Ann
0 Nat
2 Joe
Sampling a bucketed table is very efficient because the query only has to read the buckets
that match the TABLESAMPLE clause. Contrast this with sampling a nonbucketed table
using the rand() function, where the whole input dataset is scanned, even if only a very
small sample is needed:
hive> SELECT * FROM users
> TABLESAMPLE(BUCKET 1 OUT OF 4 ON rand());
2 Joe
Storage Formats
There are two dimensions that govern table storage in Hive: the row format and the file
format . The row format dictates how rows, and the fields in a particular row, are stored. In
Hive parlance, the row format is defined by a SerDe , a portmanteau word for a Serial-
izer-Deserializer.
When acting as a deserializer, which is the case when querying a table, a SerDe will
deserialize a row of data from the bytes in the file to objects used internally by Hive to op-
erate on that row of data. When used as a serializer, which is the case when performing an
INSERT or CTAS (see Importing Data ) , the table's SerDe will serialize Hive's internal
representation of a row of data into the bytes that are written to the output file.
The file format dictates the container format for fields in a row. The simplest format is a
plain-text file, but there are row-oriented and column-oriented binary formats available,
too.
The default storage format: Delimited text
When you create a table with no ROW FORMAT or STORED AS clauses, the default
format is delimited text with one row per line. [ 114 ]
Search WWH ::




Custom Search