Database Reference
In-Depth Information
Internally, Hive uses a SerDe called LazySimpleSerDe for this delimited format,
along with the line-oriented MapReduce text input and output formats we saw in
Chapter 8 . The “lazy” prefix comes about because it deserializes fields lazily — only as
they are accessed. However, it is not a compact format because fields are stored in a verb-
ose textual format, so a Boolean value, for instance, is written as the literal string true or
false .
The simplicity of the format has a lot going for it, such as making it easy to process with
other tools, including MapReduce programs or Streaming, but there are more compact and
performant binary storage formats that you might consider using. These are discussed
next.
Binary storage formats: Sequence files, Avro datafiles, Parquet files, RCFiles, and
ORCFiles
Using a binary format is as simple as changing the STORED AS clause in the CREATE
TABLE statement. In this case, the ROW FORMAT is not specified, since the format is con-
trolled by the underlying binary file format.
Binary formats can be divided into two categories: row-oriented formats and column-ori-
ented formats. Generally speaking, column-oriented formats work well when queries ac-
cess only a small number of columns in the table, whereas row-oriented formats are ap-
propriate when a large number of columns of a single row are needed for processing at the
same time.
The two row-oriented formats supported natively in Hive are Avro datafiles (see
Chapter 12 ) and sequence files (see SequenceFile ) . Both are general-purpose, splittable,
compressible formats; in addition, Avro supports schema evolution and multiple language
bindings. From Hive 0.14.0, a table can be stored in Avro format using:
SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;
CREATE TABLE ... STORED AS AVRO;
Notice that compression is enabled on the table by setting the relevant properties.
Similarly, the declaration STORED AS SEQUENCEFILE can be used to store sequence
files in Hive. The properties for compression are listed in Using Compression in MapRe-
duce .
Hive has native support for the Parquet (see Chapter 13 ) , RCFile, and ORCFile column-
oriented binary formats (see Other File Formats and Column-Oriented Formats ) . Here is
Search WWH ::




Custom Search