Databases Reference
In-Depth Information
have the concept of key-value pairs and thus uses an empty object called NullWritable
in place of the value . For the key, Sqoop uses the generated class. For convenience, this
generated class is copied to the directory where Sqoop is executed. You will need to
integrate this generated class to your application if you need to read a Sqoop-generated
SequenceFile .
Apache Avro is a generic data serialization system. Specifying the --asavrodatafile
parameter instructs Sqoop to use its compact and fast binary encoding format. Avro is
a very generic system that can store any arbitrary data structures. It uses a concept called
schema to describe what data structures are stored within the file. The schema is usually
encoded as a JSON string so that it's decipherable by the human eye. Sqoop will generate
the schema automatically based on the metadata information retrieved from the data‐
base server and will retain the schema in each generated file. Your application will need
to depend on Avro libraries in order to open and process data stored as Avro. You don't
need to import any special class, such as in the SequenceFile case, as all required
metadata is embedded in the imported files themselves.
2.6. Compressing Imported Data
Problem
You want to decrease the overall size occupied on HDFS by using compression for
generated files.
Solution
Use the parameter --compress to enable compression:
sqoop import \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--table cities \
--compress
Discussion
Sqoop takes advantage of the inherent parallelism of Hadoop by leveraging Hadoop's
execution engine, MapReduce, to perform data transfers. As MapReduce already has
excellent support for compression, Sqoop simply reuses its powerful abilities to provide
compression options. By default, when using the --compress parameter, output files
will be compressed using the GZip codec, and all files will end up with a .gz extension.
You can choose any other codec using the --compression-codec parameter. The fol‐
lowing example uses the BZip2 codec instead of GZip (files on HDFS will end up having
the .bz2 extension):
Search WWH ::




Custom Search