Database Reference
In-Depth Information
Unlike a text file, though, you can't open a binary file in your favorite
text editor and understand the data. Other applications can't understand
the data either, unless they have been built specifically to understand the
format.Insomecases,though,theimprovedperformancecanoffsetthelack
of portability of the binary file formats.
Hive supports several binary formats natively for files. One option is the
Sequence File format. Sequence files consist of binary encoded key/value
pairs. This is a standard file format for Hadoop, so it will be usable by many
other tools in the Hadoop ecosystem.
Another option is the RCFile format. RCFile uses a columnar storage
approach, rather than the row-based approach familiar to users of relational
systems. In the columnar approach, the values in a column are compressed
so that only the distinct values for the column need to be stored, rather than
the repeated values for each row. This can help compress the data a great
deal, particularly if the column values are repeated for many rows. RCFiles
are readable through Hive, but not from most other Hadoop tools.
A variation on the RCFile is the Optimized Record Columnar File format
(ORCFile). This format includes additional metadata in the file system,
which can vastly speed up the querying of Hive data. This was released as
part of Hive 0.11.
NOTE
Compression is an option for your Hadoop data, and Hive can
decompress the data as needed for processing. Hive and Hadoop have
native support for compressing and decompressing files on demand
using a variety of compression types, including common formats like
Zip compression. This can be an alternative that allows to you get the
benefits of smaller data formats while still keeping the data in a text
format.
If the data is in a binary or text format that Hive doesn't understand, custom
logic can be developed to support it. The next section discusses how these
can be implemented.
Search WWH ::




Custom Search