Database Reference
In-Depth Information
Internally, Hive uses a SerDe called
LazySimpleSerDe
for this delimited format,
along with the line-oriented MapReduce text input and output formats we saw in
they are accessed. However, it is not a compact format because fields are stored in a verb-
ose textual format, so a Boolean value, for instance, is written as the literal string
true
or
false
.
The simplicity of the format has a lot going for it, such as making it easy to process with
other tools, including MapReduce programs or Streaming, but there are more compact and
performant binary storage formats that you might consider using. These are discussed
next.
Binary storage formats: Sequence files, Avro datafiles, Parquet files, RCFiles, and
ORCFiles
Using a binary format is as simple as changing the
STORED AS
clause in the
CREATE
TABLE
statement. In this case, the
ROW FORMAT
is not specified, since the format is con-
trolled by the underlying binary file format.
Binary formats can be divided into two categories: row-oriented formats and column-ori-
ented formats. Generally speaking, column-oriented formats work well when queries ac-
cess only a small number of columns in the table, whereas row-oriented formats are ap-
propriate when a large number of columns of a single row are needed for processing at the
same time.
The two row-oriented formats supported natively in Hive are Avro datafiles (see
Chapter 12
)
and sequence files (see
SequenceFile
)
. Both are general-purpose, splittable,
compressible formats; in addition, Avro supports schema evolution and multiple language
bindings. From Hive 0.14.0, a table can be stored in Avro format using:
SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;
CREATE TABLE
...
STORED AS AVRO;
Notice that compression is enabled on the table by setting the relevant properties.
Similarly, the declaration
STORED AS SEQUENCEFILE
can be used to store sequence
files in Hive. The properties for compression are listed in
Using Compression in MapRe-
Hive has native support for the Parquet (see
Chapter 13
)
, RCFile, and ORCFile column-
oriented binary formats (see
Other File Formats and Column-Oriented Formats
)
. Here is