Database Reference
In-Depth Information
Text and Binary File Formats
Sqoop is capable of importing into a few different file formats. Text files (the default) of-
fer a human-readable representation of data, platform independence, and the simplest
structure. However, they cannot hold binary fields (such as database columns of type
VARBINARY
), and distinguishing between
null
values and
String
-based fields con-
taining the value
"null"
can be problematic (although using the
--null-string
im-
port option allows you to control the representation of
null
values).
To handle these conditions, Sqoop also supports
SequenceFile
s, Avro datafiles, and
Parquet files. These binary formats provide the most precise representation possible of the
imported data. They also allow data to be compressed while retaining MapReduce's abil-
ity to process different sections of the same file in parallel. However, current versions of
Sqoop cannot load Avro datafiles or
SequenceFile
s into Hive (although you can load
Avro into Hive manually, and Parquet can be loaded directly into Hive by Sqoop). Anoth-
er disadvantage of
SequenceFile
s is that they are Java specific, whereas Avro and
Parquet files can be processed by a wide range of languages.