Adding Structure with Hive - Microsoft Big Data Solutions

Database Reference

In-Depth Information

Unlike a text file, though, you can't open a binary file in your favorite

text editor and understand the data. Other applications can't understand

the data either, unless they have been built specifically to understand the

format.Insomecases,though,theimprovedperformancecanoffsetthelack

of portability of the binary file formats.

Hive supports several binary formats natively for files. One option is the

Sequence File format. Sequence files consist of binary encoded key/value

pairs. This is a standard file format for Hadoop, so it will be usable by many

other tools in the Hadoop ecosystem.

Another option is the RCFile format. RCFile uses a columnar storage

approach, rather than the row-based approach familiar to users of relational

systems. In the columnar approach, the values in a column are compressed

so that only the distinct values for the column need to be stored, rather than

the repeated values for each row. This can help compress the data a great

deal, particularly if the column values are repeated for many rows. RCFiles

are readable through Hive, but not from most other Hadoop tools.

A variation on the RCFile is the Optimized Record Columnar File format

(ORCFile). This format includes additional metadata in the file system,

which can vastly speed up the querying of Hive data. This was released as

part of Hive 0.11.

NOTE

Compression is an option for your Hadoop data, and Hive can

decompress the data as needed for processing. Hive and Hadoop have

native support for compressing and decompressing files on demand

using a variety of compression types, including common formats like

Zip compression. This can be an alternative that allows to you get the

benefits of smaller data formats while still keeping the data in a text

format.

If the data is in a binary or text format that Hive doesn't understand, custom

logic can be developed to support it. The next section discusses how these

can be implemented.

Search WWH ::

Custom Search

Home