Database Reference
In-Depth Information
Custom File and Record Formats
Hive leverages Hadoop's ability to use custom logic for processing files. A
full discussion of the implementation of custom logic for this is beyond the
scope of this chapter, but this section does cover the basics.
First, you want to understand that Hive (and Hadoop in general) makes a
distinction between the file format and the record format. The file format
determines how records are stored in the file, and the record format
determines how individual fields are extracted from each record.
By default, Hive uses the TEXTFILE format for the file format. You can
override this for each Hive table by specifying a custom input format and a
custom output format. The input format controls how records are written to
the file, and the output format controls how the record is read from the file.
If the record format of the file doesn't match one of the natively supported
formats, you must provide an implementation of both the input format and
output format, or Hive will not be able to use the file. Implementing the
custominputandoutputformatsisusuallydoneinJava,althoughMicrosoft
is providing support for .NET-based implementations as well.
The record format is the next aspect to consider. As discussed already,
the default record format is a text with delimiters between fields. If the
record format requires custom processing, you must provide a reference
to a serializer/deserializer (or SerDe). SerDes implements the logic for
serializing the fields in a record to a specific record format and for
deserializing that record format back to the individual fields.
Hive includes a couple of standard SerDes. The delimited record format is
thedefaultSerDe,anditcanbecustomizedtousedifferentdelimiters,inthe
event that a file uses a record format with nonstandard delimiters.
One of the other included SerDes handles regular expressions. The
RegexSerde is useful when processing web logs and other text files where
the format can vary but values can be extracted using pattern matching.
Search WWH ::




Custom Search