Cloudera's Distribution Including Apache Hadoop - Cloudera Administration

Database Reference

In-Depth Information

Apache Sqoop

While analyzing data, data analysts often have to gather data from different sources such as

external relational databases and bring it into HDFS for processing. Also, after processing

data in Hadoop, analysts may also send the data from HDFS back to some external rela-

tional data stores. Apache Sqoop is just the tool for such requirements. Sqoop is used to

transfer data between HDFS and relational database systems such as MySQL and Oracle.

Sqoop expects the external database to define the schema for the imports to HDFS. Here,

the schema refers to metadata or the structure of the data. The importation and exportation

of data in Sqoop is done using MapReduce, thereby leveraging the robust features of

MapReduce to perform its operations.

When importing data from an external relational database, Sqoop takes the table as an in-

put, reads the table row by row, and generates output files that are placed in HDFS. The

Sqoop import runs in a parallel model (MapReduce), generating several output files for a

single input table.

The following diagram shows the two-way flow of data from RDBMS to HDFS and vice

versa:

Once the data is in HDFS, analysts process this data, which generates subsequent output

files. These results, if required, can be exported to an external relational database system

using Sqoop. Sqoop reads delimited files from HDFS, constructs database records, and in-

serts them into the external table.

Sqoop is a highly configurable tool where you can define the columns that need to be im-

ported/exported to and from HDFS. All operations in Sqoop are done using the command-

line interface. Sqoop 2, a newer version of Sqoop, now provides an additional web user in-

terface to perform the importations and exportations.

Search WWH ::

Custom Search

Home