Sqoop - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Imports: A Deeper Look

As mentioned earlier, Sqoop imports a table from a database by running a MapReduce job

that extracts rows from the table, and writes the records to HDFS. How does MapReduce

read the rows? This section explains how Sqoop works under the hood.

At a high level, Figure 15-1 demonstrates how Sqoop interacts with both the database

source and Hadoop. Like Hadoop itself, Sqoop is written in Java. Java provides an API

called Java Database Connectivity, or JDBC, that allows applications to access data stored

in an RDBMS as well as to inspect the nature of this data. Most database vendors provide a

JDBC driver that implements the JDBC API and contains the necessary code to connect to

their database servers.

NOTE

Based on the URL in the connect string used to access the database, Sqoop attempts to predict which

driver it should load. You still need to download the JDBC driver itself and install it on your Sqoop client.

For cases where Sqoop does not know which JDBC driver is appropriate, users can specify the JDBC

driver explicitly with the --driver argument. This capability allows Sqoop to work with a wide variety

of database platforms.

Before the import can start, Sqoop uses JDBC to examine the table it is to import. It re-

trieves a list of all the columns and their SQL data types. These SQL types ( VARCHAR ,

INTEGER , etc.) can then be mapped to Java data types ( String , Integer , etc.), which

will hold the field values in MapReduce applications. Sqoop's code generator will use this

information to create a table-specific class to hold a record extracted from the table.

Search WWH ::

Custom Search

Home